oneflow

OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.

APACHE-2.0 License

Stars
5.7K
Committers
164

Bot releases are visible (Hide)

oneflow - Version 1.0.0 Latest Release

Published by levi131 7 months ago

Version 1.0.0

OneFlow v1.0.0 release note

OneFlow v1.0.0 came out, welcome to install the new version for a better experience.

  • Highlights
  • New Features
  • Improvements
  • Changes and Fixes
  • Performance

Highlights

This version update includes 447 commits and the following highlights:

  • Released a new interface compile_from_torch. This interface, while sharing the parameter memory, converts a PyTorch Module instance into a OneFlow Module instance. It supports direct Eager execution or conversion into a static graph nn.Graph, further accelerating the process using MLIR compilation. This interface is rapidly evolving and currently supports dynamic shape compilation, validated across typical models such as ResNet50, Faster RCNN, and Stable Diffusion.

  • Made a series of optimizations and refactoring to Eager execution runtime, including unification of system memory pools, integration with CUDA native interfaces, optimization of instruction scheduling mechanisms, introduction of an instruction fusion mechanism, optimization of Autograd graph construction speed, optimization of Op inference process, and decoupling of Instruction and Stream, etc.

  • The static graph distributed physical execution plan supports separate compilation functionality, allowing each process to independently compile its required execution plan, eliminating linear growth of compilation time with GPU scale.

  • Addition of a series of functional automatic differentiation related interface supports, including jvp, vjp, hvp, vhp, jacobian, and hessian.

  • Addition of the Insight module, supporting visualization of kernel invocation, execution time, speed, and other related information within the embedded point intervals.

  • Updates to LiBai (the open-source toolbox for large-scale model training), with native support for fine-tuning and distributed inference of Llama2 and ChatGLM2, supporting full finetune, adapter finetune, lora finetune. lm-eval-harness can be used for language model evaluation and validation.

  • Upgrade of OneFlow Serving functionality, adding support for OneFlow Python backend and OneFlow Lite backend, in addition to the existing support for OneFlow Cpp backend.

New Features

1. compile_from_torch

The compile_from_torch interface, while sharing the parameter memory, converts a PyTorch Module instance into a OneFlow Module instance. It supports direct Eager execution or conversion into a static graph nn.Graph, further accelerating the process using MLIR compilation. (https://github.com/Oneflow-Inc/oneflow/pull/10404, https://github.com/Oneflow-Inc/oneflow/pull/10408, https://github.com/Oneflow-Inc/oneflow/pull/9984, https://github.com/Oneflow-Inc/oneflow/pull/9754)

Interface Signature and Parameter Introduction:

compile_from_torch(torch_module: torch.nn.Module, \*, use_graph=True, options={})
* torch_module: The Torch Module instance to be converted.
* use_graph: Indicates whether to transform into a static graph nn.Graph and utilize MLIR compilation acceleration. The default is True.
* options:
  * size: When using static graph nn.Graph, the hash value of the graph corresponding to the input shape will be calculated and cached. Size indicates the maximum capacity of the static graph cache. When exceeding the maximum capacity, the graph will be cleared based on the LRU strategy. The default value is 9.
  * dynamic: For the first input with a dynamic shape, the graph will be fully compiled. For subsequent inputs with different shapes, if dynamic is True, shared graph will be used for compilation acceleration; if dynamic is False, the compilation will be performed each time. The default is True.
  * debug: Debug mode and log level settings. -1 disables debug mode, 0 outputs warnings and static graph construction information, 1 additionally outputs graph construction information for each sub-module, 2 additionally outputs progress for each operator, 3 provides more detailed operator information. The default value is -1.

Example of Usage:

import torch
from torchvision import models
import oneflow
from oneflow.framework.infer_compiler import compile_from_torch
DEVICE = torch.device("cuda")
WEIGHT = models.ResNet50_Weights.DEFAULT
model = models.resnet50(weights=WEIGHT).to(DEVICE)
compile_model = compile_from_torch(model, options={"dynamic": True})

2. Separated Compilation

The static graph distributed physical execution plan supports separate compilation , allowing each process to independently compile its required execution plan, thereby preventing linear growth of compilation time with GPU scale. The separate compilation feature supports 3D hybrid parallel (data parallelism + model parallelism + pipeline parallelism) scenarios and can be used together with LiBai (the open-source large-scale model training toolbox). To enable this feature, use the command: export ONEFLOW_ENABLE_LAZY_SEPARATE_COMPILE=1. (https://github.com/Oneflow-Inc/oneflow/pull/9920, https://github.com/Oneflow-Inc/oneflow/pull/10140, https://github.com/Oneflow-Inc/oneflow/pull/10141, https://github.com/Oneflow-Inc/oneflow/pull/10124, https://github.com/Oneflow-Inc/oneflow/pull/10102)

Below are the test results on a 128-card A100-PCIE-40GB device with LiBai on the GPT2 model:

Parallelism Separated Compilation Enabled Execution Plan Compilation Time
Data Parallelism (DP128 MP1 PP1) No Over 20 minutes
Data Parallelism (DP128 MP1 PP1) Yes 108.21 s
3D Parallelism (DP4 MP4 PP8) No 445.16 s
3D Parallelism (DP4 MP4 PP8) Yes 82.88 s

3. Functional Automatic Differentiation Interfaces

A series of functional automatic differentiation-related interfaces have been introduced, including jvp, vjp, hvp, vhp, jacobian, and hessian. (https://github.com/Oneflow-Inc/oneflow/pull/10412, https://github.com/Oneflow-Inc/oneflow/pull/10428)

Example of Usage:

import oneflow as flow

# jacobian example
def exp_reducer(x):
    return x.exp().sum(dim=1)

input = flow.rand(2, 2)
jac_rslt = flow.autograd.functional.jacobian(exp_reducer, input)

# vhp example
def pow_reducer(x):
    return x.pow(3).sum()

input = flow.rand(2, 2)
v = flow.ones(2, 2)
vhp_rslt = flow.autograd.functional.vhp(pow_reducer, input, v)

4. Insight Module

Introduced a new Insight module, enabling the visualization of kernel invocation, execution time, speed, and other related information within the embedded point intervals. (https://github.com/Oneflow-Inc/oneflow/pull/10370)

Usage:

  • Step 1: Set embedded point intervals in the code using the OneFlow Profiler module.
  • Step 2: Run the code and use NVIDIA Nsight Systems to generate a .sqlite file.
  • Step 3: Use the OneFlow Insight module to generate a .json file.
  • Step 4: Open the .json file in the browser at chrome://tracing/ or edge://tracing/ to obtain the visualization interface.

For more detailed information, please refer to: https://github.com/Oneflow-Inc/oneflow/tree/master/python/oneflow/utils/insight#usage

5. LiBai Version Update

  • LiBai (the open-source toolbox for large-scale model training) has been upgraded to version v0.3.0. It now natively supports finetuning and distributed inference of large language models Llama2 and ChatGLM2. It supports full full finetune, adapter finetune, lora finetune. lm-eval-harness can be used for language model evaluation and validation.

  • The distributed training and inference support for ChatGLM and Llama2 are as follows:

Example of Usage:

# full finetune
bash tools/train.sh projects/Llama/train_net.py projects/Llama/configs/llama_sft.py 8
# adapter finetune
bash tools/train.sh projects/Llama/adapter/train_net.py projects/Llama/adapter/adapter_sft.py 8
# inference
bash tools/infer.sh projects/Llama/pipeline.py 8
# eval
python projects/Llama/utils/eval_adapter.py

6. Other New Features

Improvements

1. Eager Runtime Optimization and Refactoring

A series of optimizations and refactoring has been implemented for the Eager runtime, including:

Users can configure the Eager runtime using various environment variables:

Environment Variable Meaning Default Value
ONEFLOW_VM_COMPUTE_ON_WORKER_THREAD Whether to perform computation on worker threads true
ONEFLOW_VM_MULTI_THREAD Whether to use multi-threaded collaboration for Eager computation true
ONEFLOW_VM_ENABLE_STREAM_WAIT Whether to use stream_wait mechanism for dependencies between multiple streams true
ONEFLOW_VM_ENABLE_SCHEDULE_YIELD Whether to use yield mechanism to reduce scheduler thread's busy wait true
ONEFLOW_EAGER_ENABLE_LOCAL_INFER_CACHE Whether to cache operator output metadata during computation true
ONEFLOW_VM_WORKER_THREAD_LIMIT Number of worker threads 16
ONEFLOW_VM_PENDING_HANDLE_WINDOW_SIZE Maximum size for fusing vm instructions 10
ONEFLOW_VM_BLOCKING_DEBUG_INSTRUCTIONS_DISPLAY_LIMIT Number of unprocessed instructions to be printed when vm execution times out 1000

2. Upgrade of OneFlow Serving Features

OneFlow Serving features have been upgraded to support additional backends, including OneFlow Python backend and OneFlow Lite backend, in addition to the existing support for the OneFlow Cpp backend.

  • The OneFlow Cpp backend enables deployment in a Python-independent environment to achieve the highest performance.
  • The OneFlow Lite backend enables deployment on edge devices.
  • The OneFlow Python backend facilitates the deployment of complex models with minimal migration cost.

For usage instructions, refer to: https://github.com/Oneflow-Inc/serving/blob/main/README.md

3. Other Functionality Improvements

Changes and Fixes

1. Functional Changes

2. Bug Fixes

Performance

1. OneFlow compile_from_torch VS PyTorch compile

Compile and execute the backbone parts of ResNet50 and Faster RCNN models using OneFlow compile_from_torch and PyTorch compile interfaces to test the inference performance with inputs of different shapes. The results are shown in the table below:

Model input shape PyTorch compile OneFlow compile_from_torch dynamic test timing
ResNet50 (1, 3, 512, 512) 21.328 s 3.205 s False initial compilation and execution
ResNet50 (2, 3, 896, 512) 14.167 s 1.523 s False continuous compilation and execution
ResNet50 (2, 3, 512, 896) 13.364 s 1.402 s False continuous compilation and execution
ResNet50 (3, 3, 896, 896) 15.056 s 1.539 s False continuous compilation and execution
ResNet50 (2, 3, 1024, 896) 14.167 s 1.500 s False continuous compilation and execution
ResNet50 (2, 3, 896, 1024) 12.891 s 1.494 s False continuous compilation and execution
ResNet50 (6, 3, 1024, 1024) 14.859 s 1.872 s False continuous compilation and execution
ResNet50 (1, 3, 512, 512) 170.446 s 3.143 s True initial compilation and execution
ResNet50 (2, 3, 896, 512) 185.672 s 0.851 s True continuous compilation and execution
ResNet50 (2, 3, 512, 896) 0.089 s 0.836 s True continuous compilation and execution
ResNet50 (3, 3, 896, 896) 0.084 s 0.980 s True continuous compilation and execution
ResNet50 (2, 3, 1024, 896) 0.077 s 0.942 s True continuous compilation and execution
ResNet50 (2, 3, 896, 1024) 0.080 s 0.931 s True continuous compilation and execution
ResNet50 (6, 3, 1024, 1024) 0.084 s 1.406 s True continuous compilation and execution
Faster RCNN (1, 3, 512, 512) 18.224 s 5.483 s False initial compilation and execution
Faster RCNN (2, 3, 896, 512) 9.200 s 3.011 s False continuous compilation and execution
Faster RCNN (2, 3, 512, 896) 9.331 s 3.025 s False continuous compilation and execution
Faster RCNN (3, 3, 896, 896) 9.301 s 2.854 s False continuous compilation and execution
Faster RCNN (2, 3, 1024, 896) 9.290 s 2.805 s False continuous compilation and execution
Faster RCNN (2, 3, 896, 1024) 9.123 s 2.851 s False continuous compilation and execution
Faster RCNN (6, 3, 1024, 1024) 9.377 s 3.180 s False continuous compilation and execution
Faster RCNN (1, 3, 512, 512) 25.444 s 5.430 s True initial compilation and execution
Faster RCNN (2, 3, 896, 512) 25.381 s 1.899 s True continuous compilation and execution
Faster RCNN (2, 3, 512, 896) 0.116 s 1.886 s True continuous compilation and execution
Faster RCNN (3, 3, 896, 896) 1.982 s 1.793 s True continuous compilation and execution
Faster RCNN (2, 3, 1024, 896) 0.114 s 1.803 s True continuous compilation and execution
Faster RCNN (2, 3, 896, 1024) 0.111 s 1.778 s True continuous compilation and execution
Faster RCNN (6, 3, 1024, 1024) 0.143 s 2.110 s True continuous compilation and execution

Using the OneFlow compile_from_torch and PyTorch compile interfaces, the unet section of the Stable Diffusion model was compiled and executed to test the inference performance with outputs of different shapes. The results are presented in the table below:

Model Output shape PyTorch compile OneFlow compile_from_torch dynamic test timing
Stable Diffusion (2, 512, 512) 103.701 s 63.670 s False initial compilation and execution
Stable Diffusion (1, 512, 768) 95.137 s 53.864 s False continuous compilation and execution
Stable Diffusion (2, 768, 512) 90.259 s 55.271 s False continuous compilation and execution
Stable Diffusion (1, 768, 768) 90.196 s 51.590 s False continuous compilation and execution
Stable Diffusion (2, 512, 512) 275.660 s 57.117 s True initial compilation and execution
Stable Diffusion (1, 512, 768) 345.774 s 43.752 s True continuous compilation and execution
Stable Diffusion (2, 768, 512) 349.835 s 47.653 s True continuous compilation and execution
Stable Diffusion (1, 768, 768) 7.224 s 45.720 s True continuous compilation and execution
Stable Diffusion (2, 512, 512) 4.088 s 2.831 s False subsequent execution
Stable Diffusion (1, 512, 768) 3.296 s 2.325 s False subsequent execution
Stable Diffusion (2, 768, 512) 5.594 s 5.157 s False subsequent execution
Stable Diffusion (1, 768, 768) 4.713 s 3.557 s False subsequent execution
Stable Diffusion (2, 512, 512) 4.448 s 2.801 s True subsequent execution
Stable Diffusion (1, 512, 768) 3.201 s 2.314 s True subsequent execution
Stable Diffusion (2, 768, 512) 6.093 s 4.166 s True subsequent execution
Stable Diffusion (1, 768, 768) 4.920 s 3.557 s True subsequent execution

Conclusion: The OneFlow compile_from_torch interface generally has shorter compilation times compared to the PyTorch compile interface. Additionally, benefiting from the exceptional operator optimizations in the OneFlow framework, there is superior execution performance on the Stable Diffusion model.

Note: The tests were conducted with GPU 3090, PyTorch v2.1.2 and CUDA 12.2.

2. OneFlow Eager vs PyTorch Eager

Model GPU model number of GPUs macro batch PyTorch performance(iter/s) OneFlow performance(iter/s) speedup ratio
ResNet50 3090 1 1 31.37 38.81 23.72%
ResNet50 3090 1 2 32.06 48.45 51.12%
ResNet50 3090 2 1 31.10 33.46 7.59%
ResNet50 3090 2 2 31.76 34.83 9.67%
ResNet50 A100 1 1 24.60 46.64 89.59%
ResNet50 A100 1 2 25.06 49.88 99.04%
ResNet50 A100 2 1 25.28 39.18 54.98%
ResNet50 A100 2 2 24.09 32.84 36.32%
Bert 3090 1 1 8.93 10.41 16.57%
Bert 3090 1 2 13.11 14.31 9.15%
Bert 3090 2 1 6.94 8.27 19.16%
Bert 3090 2 2 12.19 15.58 27.81%
Bert A100 1 1 10.45 12.72 21.72%
Bert A100 1 2 20.24 21.57 6.57%
Bert A100 2 1 12.63 16.09 27.39%
Bert A100 2 2 24.86 29.84 20.03%

Conclusion: Compared to PyTorch Eager, using OneFlow Eager shows significant performance advantages in small batch scenarios for both ResNet50 and BERT models.

Note: The tests were conducted using PyTorch v2.1.0 and CUDA 12.1.

Version 1.0.0

OneFlow v1.0.0 release note

OneFlow 发布 v1.0.0 版本, 欢迎大家安装使用。

  • 重点内容
  • 新特性
  • 功能改进
  • 改动与修复
  • 性能

重点内容

本次版本更新包含 447 个 commits 和如下重点内容:

  • 发布新接口 compile_from_torch。该接口在共享参数显存的情况下,将 PyTorch 的 Module 实例转化成 OneFlow 的 Module 实例,支持直接 Eager 运行或者转化为静态图 nn.Graph 并进一步使用 MLIR 编译加速。该接口仍在快速演进中,目前支持了动态形状编译并在ResNet50、Faster RCNN、Stable Diffusion三个典型模型上做了验证。

  • 对 Eager 运行时做了一系列优化与重构,包括统一系统内存池、对接 CUDA 原生接口、优化指令调度机制、引入指令融合机制、优化 Autograd 构图速度、优化 Op 推导过程、解耦 Instruction 与 Stream 等。

  • 静态图分布式物理执行计划支持分离编译功能,每个进程独立编译自己所需的执行计划,使得编译时间不再随 GPU 规模线性增长。

  • 新增一系列函数式自动微分相关接口支持,包括 jvp、vjp、hvp、vhp、jacobian、hessian。

  • 新增 Insight 模块,支持可视化地展示埋点区间内 kernel 调用、执行时间、速度等信息。

  • 大规模模型训练开源工具箱 LiBai 版本更新,原生支持大语言模型 Llama2 和 ChatGLM2 的 finetune 和分布式推理,支持 full finetune、adapter finetune、lora finetune,可使用 lm-eval-harness 对语言模型进行评测验证。

  • OneFlow Serving 功能升级,在原有支持 OneFlow Cpp 后端的基础上,新增支持 OneFlow Python 后端和 OneFlow Lite 后端。

新特性

1、compile_from_torch

compile_from_torch 接口在共享参数显存的情况下,将 PyTorch 的 Module 实例转化成 OneFlow 的 Module 实例,支持直接 Eager 运行或者转化为静态图 nn.Graph 并进一步使用 MLIR 编译加速。(https://github.com/Oneflow-Inc/oneflow/pull/10404, https://github.com/Oneflow-Inc/oneflow/pull/10408, https://github.com/Oneflow-Inc/oneflow/pull/9984, https://github.com/Oneflow-Inc/oneflow/pull/9754)

接口签名及参数介绍:

compile_from_torch(torch_module: torch.nn.Module, \*, use_graph=True, options={})
* torch_module:需要被转换的 Torch Module 实例。
* use_graph:是否转化为静态图 nn.Graph 并使用 MLIR 编译加速,默认为 True。
* options:
  * size: 使用静态图 nn.Graph 后会根据输入的 shape 计算 hash 值缓存相应的 graph ,size 表示静态图缓存的最大容量,超过最大容量会根据 LRU 策略对 graph 进行清理,默认值为 9。
  * dynamic:对于动态 shape 的输入第一次会完整编译 graph,之后的对于不同 shape 的输入当 dynamic 为 True 时会启用共享图进行编译加速,dynamic 为 False 时每次都会重新进行编译,默认为 True。
  * debug:调试模式和日志级别设置,-1 禁用调试模式,0 输出警告和静态图构建信息,1 额外输出每个子模块的构图信息,2 额外输出每个算子的进度,3 输出更详细的算子信息,默认为 -1。

使用示例:

import torch
from torchvision import models

import oneflow
from oneflow.framework.infer_compiler import compile_from_torch

DEVICE = torch.device("cuda")
WEIGHT = models.ResNet50_Weights.DEFAULT
model = models.resnet50(weights=WEIGHT).to(DEVICE)
compile_model = compile_from_torch(model, options={"dynamic": True})

2、分离编译

静态图分布式物理执行计划支持分离编译功能,每个进程独立编译自己所需的执行计划,使得编译时间不再随 GPU 规模线性增长。分离编译功能支持 3D 混合并行(数据并行+模型并行+流水并行)场景,可与大规模模型训练开源工具箱 LiBai 一同使用,打开方式为:export ONEFLOW_ENABLE_LAZY_SEPARATE_COMPILE=1。(https://github.com/Oneflow-Inc/oneflow/pull/9920, https://github.com/Oneflow-Inc/oneflow/pull/10140, https://github.com/Oneflow-Inc/oneflow/pull/10141, https://github.com/Oneflow-Inc/oneflow/pull/10124, https://github.com/Oneflow-Inc/oneflow/pull/10102)

以下是在 128 卡 A100-PCIE-40GB 设备上,配合 LiBai 在 GPT2 模型上的测试结果:

并行方式 是否开启分离编译 执行计划编译时间
数据并行 (DP128 MP1 PP1) 超过 20 minutes
数据并行 (DP128 MP1 PP1) 108.21 s
3D 并行 (DP4 MP4 PP8) 445.16 s
3D 并行 (DP4 MP4 PP8) 82.88 s

3、函数式自动微分接口

新增一系列函数式自动微分相关接口支持,包括 jvp、vjp、hvp、vhp、jacobian、hessian。(https://github.com/Oneflow-Inc/oneflow/pull/10412, https://github.com/Oneflow-Inc/oneflow/pull/10428)

使用示例:

import oneflow as flow

# jacobian example
def exp_reducer(x):
    return x.exp().sum(dim=1)

input = flow.rand(2, 2)
jac_rslt = flow.autograd.functional.jacobian(exp_reducer, input)

# vhp example
def pow_reducer(x):
    return x.pow(3).sum()

input = flow.rand(2, 2)
v = flow.ones(2, 2)
vhp_rslt = flow.autograd.functional.vhp(pow_reducer, input, v)

4、Insight模块

新增 Insight 模块,支持可视化地展示埋点区间内 kernel 调用、执行时间、速度等信息。(https://github.com/Oneflow-Inc/oneflow/pull/10370)

使用方法如下:

  • 步骤一:使用 OneFlow Profiler 模块在代码中设置埋点区间。
  • 步骤二:运行代码并使用 NVIDIA Nsight Systems 生成 sqlite 后缀文件。
  • 步骤三:使用 OneFlow Insight 模块生成 json 文件。
  • 步骤四:在网址 chrome://tracing/ 或 edge://tracing/ 中打开 json 文件得到可视化界面。

更详细的介绍可参考:https://github.com/Oneflow-Inc/oneflow/tree/master/python/oneflow/utils/insight#usage

5、LiBai版本更新

  • 大规模模型训练开源工具箱 LiBai 功能升级,发布新版本 v0.3.0,原生支持大语言模型 Llama2 和 ChatGLM2 的 finetune 和分布式推理,支持 full finetune、adapter finetune、lora finetune,可使用 lm-eval-harness 对语言模型进行评测验证。

  • ChatGLM 和 Llama2 的分布式训练和推理支持情况如下:

使用示例:

# full finetune
bash tools/train.sh projects/Llama/train_net.py projects/Llama/configs/llama_sft.py 8

# adapter finetune
bash tools/train.sh projects/Llama/adapter/train_net.py projects/Llama/adapter/adapter_sft.py 8

# inference
bash tools/infer.sh projects/Llama/pipeline.py 8

# eval
python projects/Llama/utils/eval_adapter.py

6、其他新特性

功能改进

1、Eager 运行时优化与重构

对 Eager 运行时做了一系列优化与重构,主要包括:

可以通过一些环境变量设定 Eager 运行时行为:

环境变量 意义 默认值
ONEFLOW_VM_COMPUTE_ON_WORKER_THREAD 是否在 worker 线程上完成计算 true
ONEFLOW_VM_MULTI_THREAD 是否使用多线程协同执行 Eager 运算 true
ONEFLOW_VM_ENABLE_STREAM_WAIT 多 stream 间的依赖是否使用 stream_wait 机制 true
ONEFLOW_VM_ENABLE_SCHEDULE_YIELD 是否使用 yield 机制减少 scheduler 线程 busy wait 程度 true
ONEFLOW_EAGER_ENABLE_LOCAL_INFER_CACHE 计算过程中是否缓存算子输出的元信息 true
ONEFLOW_VM_WORKER_THREAD_LIMIT worker 线程的个数 16
ONEFLOW_VM_PENDING_HANDLE_WINDOW_SIZE vm 融合指令的最大 size 10
ONEFLOW_VM_BLOCKING_DEBUG_INSTRUCTIONS_DISPLAY_LIMIT vm 执行超时时打印未处理指令的数量 1000

2、OneFlow Serving功能升级

OneFlow Serving 功能升级,在原有支持 OneFlow Cpp 后端的基础上,新增支持 OneFlow Python 后端和 OneFlow Lite 后端。

  • 使用 OneFlow Cpp 后端可以在脱离 Python 的环境中部署以达到最高的性能。
  • 使用 OneFLow Lite 后端可以实现在端侧设备上的部署。
  • 使用 OneFlow Python 后端可以以极小的迁移代价完成复杂模型的部署。

使用方法参考:https://github.com/Oneflow-Inc/serving/blob/main/README.md

3、其他功能改进

改动与修复

1、功能改动

2、问题修复

性能

1、OneFlow compile_from_torch VS PyTorch compile

对 ResNet50 模型和 Faster RCNN 模型的 backbone 部分使用 OneFlow compile_from_torch 和 PyTorch compile 接口进行编译并执行,测试不同 shape 输入时的推理性能,结果如下表:

模型 输入 shape PyTorch compile OneFlow compile_from_torch dynamic 测试时机
ResNet50 (1, 3, 512, 512) 21.328 s 3.205 s False 首次编译执行
ResNet50 (2, 3, 896, 512) 14.167 s 1.523 s False 连续编译执行
ResNet50 (2, 3, 512, 896) 13.364 s 1.402 s False 连续编译执行
ResNet50 (3, 3, 896, 896) 15.056 s 1.539 s False 连续编译执行
ResNet50 (2, 3, 1024, 896) 14.167 s 1.500 s False 连续编译执行
ResNet50 (2, 3, 896, 1024) 12.891 s 1.494 s False 连续编译执行
ResNet50 (6, 3, 1024, 1024) 14.859 s 1.872 s False 连续编译执行
ResNet50 (1, 3, 512, 512) 170.446 s 3.143 s True 首次编译执行
ResNet50 (2, 3, 896, 512) 185.672 s 0.851 s True 连续编译执行
ResNet50 (2, 3, 512, 896) 0.089 s 0.836 s True 连续编译执行
ResNet50 (3, 3, 896, 896) 0.084 s 0.980 s True 连续编译执行
ResNet50 (2, 3, 1024, 896) 0.077 s 0.942 s True 连续编译执行
ResNet50 (2, 3, 896, 1024) 0.080 s 0.931 s True 连续编译执行
ResNet50 (6, 3, 1024, 1024) 0.084 s 1.406 s True 连续编译执行
Faster RCNN (1, 3, 512, 512) 18.224 s 5.483 s False 首次编译执行
Faster RCNN (2, 3, 896, 512) 9.200 s 3.011 s False 连续编译执行
Faster RCNN (2, 3, 512, 896) 9.331 s 3.025 s False 连续编译执行
Faster RCNN (3, 3, 896, 896) 9.301 s 2.854 s False 连续编译执行
Faster RCNN (2, 3, 1024, 896) 9.290 s 2.805 s False 连续编译执行
Faster RCNN (2, 3, 896, 1024) 9.123 s 2.851 s False 连续编译执行
Faster RCNN (6, 3, 1024, 1024) 9.377 s 3.180 s False 连续编译执行
Faster RCNN (1, 3, 512, 512) 25.444 s 5.430 s True 首次编译执行
Faster RCNN (2, 3, 896, 512) 25.381 s 1.899 s True 连续编译执行
Faster RCNN (2, 3, 512, 896) 0.116 s 1.886 s True 连续编译执行
Faster RCNN (3, 3, 896, 896) 1.982 s 1.793 s True 连续编译执行
Faster RCNN (2, 3, 1024, 896) 0.114 s 1.803 s True 连续编译执行
Faster RCNN (2, 3, 896, 1024) 0.111 s 1.778 s True 连续编译执行
Faster RCNN (6, 3, 1024, 1024) 0.143 s 2.110 s True 连续编译执行

对 Stable Diffusion 模型的 unet 部分使用 OneFlow compile_from_torch 和 PyTorch compile 接口进行编译并执行,测试不同 shape 输出时的推理性能,结果如下表:

模型 输出 shape PyTorch compile OneFlow compile_from_torch dynamic 测试时机
Stable Diffusion (2, 512, 512) 103.701 s 63.670 s False 首次编译执行
Stable Diffusion (1, 512, 768) 95.137 s 53.864 s False 连续编译执行
Stable Diffusion (2, 768, 512) 90.259 s 55.271 s False 连续编译执行
Stable Diffusion (1, 768, 768) 90.196 s 51.590 s False 连续编译执行
Stable Diffusion (2, 512, 512) 275.660 s 57.117 s True 首次编译执行
Stable Diffusion (1, 512, 768) 345.774 s 43.752 s True 连续编译执行
Stable Diffusion (2, 768, 512) 349.835 s 47.653 s True 连续编译执行
Stable Diffusion (1, 768, 768) 7.224 s 45.720 s True 连续编译执行
Stable Diffusion (2, 512, 512) 4.088 s 2.831 s False 后续执行
Stable Diffusion (1, 512, 768) 3.296 s 2.325 s False 后续执行
Stable Diffusion (2, 768, 512) 5.594 s 5.157 s False 后续执行
Stable Diffusion (1, 768, 768) 4.713 s 3.557 s False 后续执行
Stable Diffusion (2, 512, 512) 4.448 s 2.801 s True 后续执行
Stable Diffusion (1, 512, 768) 3.201 s 2.314 s True 后续执行
Stable Diffusion (2, 768, 512) 6.093 s 4.166 s True 后续执行
Stable Diffusion (1, 768, 768) 4.920 s 3.557 s True 后续执行

结论:使用 OneFlow compile_from_torch 接口有相对于 PyTorch compile 接口平均更短的编译时间,另外得益于 OneFlow 框架中极致的算子优化,在 Stable Diffusion 模型上有更优的执行性能。

备注:测试使用 GPU 型号为 3090,PyTorch 版本为 v2.1.2,cuda 版本为 12.2。

2、OneFlow Eager vs PyTorch Eager

模型 GPU 型号 卡数 macro batch PyTorch 性能(iter/s) OneFlow 性能(iter/s) 加速比
ResNet50 3090 1 1 31.37 38.81 23.72%
ResNet50 3090 1 2 32.06 48.45 51.12%
ResNet50 3090 2 1 31.10 33.46 7.59%
ResNet50 3090 2 2 31.76 34.83 9.67%
ResNet50 A100 1 1 24.60 46.64 89.59%
ResNet50 A100 1 2 25.06 49.88 99.04%
ResNet50 A100 2 1 25.28 39.18 54.98%
ResNet50 A100 2 2 24.09 32.84 36.32%
Bert 3090 1 1 8.93 10.41 16.57%
Bert 3090 1 2 13.11 14.31 9.15%
Bert 3090 2 1 6.94 8.27 19.16%
Bert 3090 2 2 12.19 15.58 27.81%
Bert A100 1 1 10.45 12.72 21.72%
Bert A100 1 2 20.24 21.57 6.57%
Bert A100 2 1 12.63 16.09 27.39%
Bert A100 2 2 24.86 29.84 20.03%

结论:使用 OneFlow Eager 相对于 PyTorch Eager 在 ResNet50 和 Bert 两个模型小 batch 场景下有明显性能优势。

备注:测试使用PyTorch版本为 v2.1.0,cuda 版本为 12.1。

oneflow - Version 0.9.0

Published by jackalcooper almost 2 years ago

Version 0.9.0

OneFlow v0.9.0 release note

OneFlow v0.9.0 came out, welcome to install the new version for a better experience.

  • Highlights
  • Backwards Incompatible Change
  • New Features
  • Performance
  • Improvements
  • Bug fixes
  • Documentation
  • Edge Tools

Highlights

This update contains 640 commits and the following highlights:

  • With the addition of 86 new API interfaces and operators aligned with PyTorch and the fix of 104 bugs related to operator compatibility, OneFlow v0.9.0 provides better PyTorch API and model compatibility. In v0.9.0, users can migrate more PyTorch models to OneFlow with one click and gain faster performance.

    • Allowing one-click migration of Stable Diffusion、GLM、YOLOv5 etc to OneFlow.

    • More convenient model migration. Oneflow.load supports loading the torch.save models directly.

    • With the newly added oneflow.mock_torch module and mock method, oneflow can migrate complex PyTorch models containing multiple scripts with one click without changing the original PyTorch script.

  • Global Tensor has added a series of interfaces and methods that are convenient for distributed programming, and fixed known related bugs.

  • The Graph released a new feature of automatic parallelism (version 1), which supports automatic search for the fastest SBP with a specified Placement. When writing distributed models with Global Tensor, users do not need to consider parallelism.

  • The Graph adds a series of optimizations related to memory, execution speed, pipeline masking, and compilation speed to improve performance and reduces memory overhead.

  • The Graph provides a series of functions to aid debugging, including analyzing memory logs, displaying the progress during the compilation stage, and the computation graph.

  • OneFlow IR provides more compilation optimization functions.

  • The error prompt of OneFlow is more user-friendly, which supports highlighting the error content and simplifies unnecessary information details inside the system. In this connection, you can visually learn about the location and type of the error.

  • A series of operator optimizations and system optimizations have been added, including Eager instruction scheduling, high-performance CUDA kernel, opening up of multiple memory pools, etc.

Backwards Incompatible Change

  • To solve the possible duplicate name conflict between Graph.Block.config and module user-defined attribute module.config, OneFlow redesigned the abstraction of Graph proxy Module/Tensor, thus introducing a breaking change: (https://github.com/Oneflow-Inc/oneflow/pull/9351 , https://github.com/Oneflow-Inc/oneflow/pull/9437,https://github.com/Oneflow-Inc/oneflow/pull/9607)

    • The attr and config attributes on Block are removed, and Block is renamed to Proxy;

    • Implementation plan: When added as members of nn.Graph, the original Eager Module and Tensor types will be packaged into the Proxy class, and the corresponding GraphModule and GraphTensor will be generated; nn.Graph will use Proxy in the subsequent composition For proxy execution, when the proxy is executed, the original eager type and graph type can be obtained from the Proxy. The naming refers to the naming of torch.fx.

    Eager primitive type Graph type, base class Graph Block Proxy execution type, the base class is called Proxy
    Function Supporting to get the original eager type A Graph code block corresponding to GraphBlock stores the information required for graph execution, such as name/scope/lazy op or tensor and optimization switches of some sub-modules on the graph. Proxy execution capability, using the same execution interface as Module and Tensor, but the behavior has changed, such as lazy, and the op that may be executed has also been rewritten.
    Module type Module GraphModule ProxyModule contains a Module member and a GraphModule member
    Tensor type Tensor GraphTensor ProxyTensor contains a Tensor member and a GraphTensor member
    • Here is an exmaple:
    import oneflow as flow
    import oneflow.nn as nn
    from oneflow.nn.graph import GraphModule
    linear = flow.nn.Linear(3, 8, False)
    class LinearGraph(nn.Graph):
        def __init__(self):
            super().__init__()
            # The type of linear is nn.Module. When added as an attribute of nn.Graph, it will be registered with nn.Graph.
            # self.linear has been wrapped as a ProxyModule.
            #self.linear.weight has been wrapped as a ProxyTensor.
            #nn.Graph will use ProxyModule to perform graph composition.
            self.linear = linear
            # There are two parts in ProxyModule, one is the original module and the other is GraphModule.
            self.linear.to(GraphModule)  # Get the corresponding GraphModule, on which you can do configuration related to graph optimization.
            # such as setting a pipeline stage for a module, and enabling pipeline parallelism. 
            self.linear.to(GraphModule).set_stage(id, placement)
            self.linear.to(nn.Module)  # get the corresponding original nn.Module.
            self.linear.weight.to(flow.Tensor)  # get the corresponding original Tensor.

Outdated interface in OneFlow v0.8.0:

import oneflow as flow
import oneflow.nn as nn
linear = flow.nn.Linear(3, 8, False)
class LinearGraph(nn.Graph):
    def __init__(self):
        super().__init__()
        self.linear = linear
        self.linear.config.set_stage(id, placement)  # set stage
        self.linear.config.activation_checkpointing = True  # set activation checkpointing
        self.linear.origin  # get the corresponding original nn.Module
        self.linear.weight.origin # get the corresponding original Tensor

New interface in OneFlow v0.9.0:

import oneflow as flow
import oneflow.nn as nn
from oneflow.nn.graph import GraphModule
linear = flow.nn.Linear(3, 8, False)
class LinearGraph(nn.Graph):
    def __init__(self):
        super().__init__()
        self.linear = linear
        self.linear.to(GraphModule).set_stage(id, placement)  # set stage
        self.linear.to(GraphModule).activation_checkpointing = True  # set activation checkpointing
        self.linear.to(nn.Module)  # get the corresponding original nn.Module
        self.linear.weight.to(flow.Tensor)  # get the corresponding original Tensor

New Features

Graph

  • Adds automatic parallelization feature for the first stage in Graph: (https://github.com/Oneflow-Inc/oneflow/pull/8891, https://github.com/Oneflow-Inc/oneflow/pull/9172 , https://github.com/Oneflow-Inc/oneflow/pull/9288)

    • Automatic parallelism can be enabled by configuring self.config.enable_auto_parallel(True) in Graph. After it is enabled, you don't have to configure sbp, and the Graph will automatically find the optimal sbp combination.

    • Here is an exmaple:

    import oneflow as flow
    class SubclassGraph(flow.nn.Graph):
        def __init__(self):
            super().__init__() # MUST be called
            # auto parallelism configuration
            self.config.enable_auto_parallel(True)
            # other configurations about auto parallelism
            # ......
    
        def build(self):
            pass
    
  • Graph supports straightened algorithm optimization with memory priority, reducing the memory life cycle of each Tensor by adjusting the execution sequence to reduce the peak value of memory. (https://github.com/Oneflow-Inc/oneflow/pull/9094)

    • With self.config.enable_straighten_algorithm("MemoryFirst"), the straightened algorithm with memory optimization can be enabled.

    • The available modes are as follows: "MemoryFirst" / "SpeedFirst" / "Disable" / "OverlapCpuGpu"

    • At the same time, Graph adds the algorithm "OverlapCpuGpu" that make CPU and GPU kernel overlap with each other as much as possible. (https://github.com/Oneflow-Inc/oneflow/pull/9278)

  • Graph provides generalized basic transmission, using nccl send/recv to realize fast communication for any NdSbp (2d, 3d,...), thus minimizing the transmission volume.(https://github.com/Oneflow-Inc/oneflow/pull/8437 , https://github.com/Oneflow-Inc/oneflow/pull/8783)

  • With autograd.Function, Graph is allowed to use custom op (https://github.com/Oneflow-Inc/oneflow/pull/8843).

  • You can use the Graph Optimizer through param_group["lr_scale"], supporting configuring the learning rate for the parameter of each module/layer. (https://github.com/Oneflow-Inc/oneflow/pull/9138)

  • Adds enable_multi_tensor_update optimization. Enabling by self.config.enable_multi_tensor_update(True), it will optimize the overhead of numerous broken parameters when updating the model. (https://github.com/Oneflow-Inc/oneflow/pull/9209, https://github.com/Oneflow-Inc/oneflow/pull/9252)

  • Adds enable_fused_model_update_cast optimization. Enabling by self.config.enable_fused_model_update_cast(True), it will speed up the training speed of the network by fusing Optimizer and fp16 cast when AMP is on. (https://github.com/Oneflow-Inc/oneflow/pull/9209)

  • Graph supports non-uniform segmentation under ND-SBP. (https://github.com/Oneflow-Inc/oneflow/pull/9310)

  • Graph supports LazyTensor's indexing feature.
    (https://github.com/Oneflow-Inc/oneflow/pull/9334)

  • Adds enable_compress_memory interface. Enabling by self.config.enable_compress_memory(True), it will try to optimize the memory and iterate the video memory of the computation graph within a half hour. Finally, the minimum value close to the lower limit will be found. (https://github.com/Oneflow-Inc/oneflow/pull/9509)

  • Adds oneflow.utils.global_view.global_mode. It supports smooth migration from single-GPU code to multi-GPU code. This global_mode will create a global context with on/off support. In addition, it will set the default placement and sbp under the context and support various grammar of LocalTensor such as Tensor.device and Tensor.to(device). The source op created in this context will automatically generate the GlobalTensor and populate the default placement and sbp. This context enables the logic of the local tensor in the module to convert to global logic in a non-invasive manner.

    • Here is an example:

    • import oneflow as flow
      from oneflow.utils.global_view import global_mode
      
      P_C = flow.placement("cpu", ranks=[0, 1])
      P = flow.placement("cuda", ranks=[0, 1])
      B = flow.sbp.broadcast
      S0 = flow.sbp.split(0)
      x = flow.ones((6, 8), placement=P_C, sbp=S0)
      
      with global_mode(True, placement=P, sbp=B):
          device = linear_dp.weight.device
          x = x.to(device) # global tensor to device
          out = linear_dp(x)
      
          # The local tensor will be converted to global
          sample = flow.randn(out.shape, device="cpu").to(device)
      

Debug

Eager

oneflow - Version 0.8.0

Published by jackalcooper over 2 years ago

OneFlow v0.8.0 Release Note

OneFlow v0.8.0 came out, welcome to install the new version for a better experience. 

  • Highlights
  • Backwards Incompatible Change
  • Deprecations
  • New Features
  • Performance
  • Improvements
  • Bug fixes
  • Documentation

Highlights

This update contains 523 commits and the following highlights:

  • PyTorch compatible APIs have been further optimized, 68 new APIs aligned with PyTorch have been added, and 84 compatibility bugs between operator and interface have been fixed. More PyTorch models support being one-button transferred into OneFlow.

  • All operators support Global Tensor more completely and efficiently, 28 Global Tensor-related bugs have been fixed, and 180 operator unit tests have been newly added.

  • Graph's advanced features have been further optimized:

    • In addition to the existing ZeRO-DP, Zero Redundancy Optimizer(ZeRO) can also be used in combination with MP parallelism, 2D parallelism, and 3D parallelism, which saves more memory overhead.

    • Graph provided new pipeline parallelism API, which not only simplifies the pipeline parallelism configuration but also optimizes the performance of pipeline parallelism and 3D parallelism.

    • Multi-dimensional debugging functionality in the logic graph, light plan physical graph, memory analysis, Python stack information, and others have been newly added, making Graph.debug more efficient.

  • Empowered by OneFlow v0.8.0 and LiBai v0.2.0, 3D parallelism speed under GPT and BERT witnesses a notable increase, and its training speed performance exceeds Megatron-LM with same configuration in multiple dimensions. For more details, please click here.

  • OneEmbedding has been released recently. It is an extension component designed for large-scale recommendation systems, boasting high efficiency, extensibility, flexibility, and other advantages.

  • Multi-Device adaptation: OneFlow v0.8.0 has provided a neat, efficient, and easily-extensible hardware abstraction layer called EP(Execution Provider) and defined a collection of basic computing interfaces called Primitive, allowing to re-implement kernels based on Primitive interface. 

  • Added new debugging tool stacks: OneFlow-Profiler and AutoProf

    • OneFlow-Profiler is a tool designed to collect performance information during framework execution. It can record the execution time of operators and system components, the allocation of memory and DRAM, and the corresponding input and parameters of operators. The information can help developers find out the main source of overhead in framework execution and thus implement targeted optimization.

    • AutoProf is a framework designed to efficiently detect the alignment between OneFlow APIs and PyTorch APIs. Besides, it can automatically compare the performance results of OneFlow APIs and PyTorch APIs.

  • Significantly optimized the exception handling process in OneFlow API and improved the error message when APIs meet exceptions.

  • Significantly optimized the OneFlow API documentation: the API documentation has been restructured based on functionality. In addition to general operator APIs, oneflow.nn.graph, oneflow.embedding, oneflow.autograd and other modules in OneFlow and their environment variables have also been explained in detail.

Backwards Incompatible Change

Outdated configuration method in OneFlow v0.7.0:

import oneflow as flow

class Graph(flow.nn.Graph):
    def __init__(self):
        super().__init__()
        self.linear = flow.nn.Linear(3, 8, False)
        self.config.set_zero_redundancy_optimizer_mode("distributed_split")
        if zero_stage > 1:
            # stage 2
            flow.boxing.nccl.enable_use_compute_stream(True)
            if zero_stage > 2:
                # stage 3
                flow.boxing.nccl.disable_group_boxing_by_dst_parallel(True)
    def build(self, x):
        return self.linear(x)

graph = Graph()

New interface in OneFlow v0.8.0:

import oneflow as flow

class Graph(flow.nn.Graph):
    def __init__(self):
        super().__init__()
        self.linear = flow.nn.Linear(3, 8, False)
        self.config.enable_zero(stage=2)
    def build(self, x):
        return self.linear(x)

graph = Graph()

Deprecations

Python API

v0.7.0

oneflow.sbp.split(axis=0)

v0.8.0

oneflow.sbp.split(dim=0)
  • For the outdated pipeline parallelism configuration method self.module_layer_0.config.stage_id = 0 (this method is not suggested ), we have added a novel pipeline parallelism API config.set_stage, which optimizes pipeline parallelism performance as well as avoids implementing the input_tensor.to_global(placement=this_stage_placement) operation for all module input tensors at every stage. (https://github.com/Oneflow-Inc/oneflow/pull/8442)

v0.7.0

import oneflow as flow

B = [flow.sbp.broadcast]
P_0 = flow.placement(type = "cuda", ranks = [0, 1])
P_1 = flow.placement(type = "cuda", ranks = [2, 3])

class Graph(flow.nn.Graph):
    def __init__(self):
        super().__init__()
        self.m_stage0 = flow.nn.Linear(8, 8, False).to_global(placement=P_0, sbp=B)
        self.m_stage1 = flow.nn.Linear(8, 8, False).to_global(placement=P_1, sbp=B)
        # Set different module's stage id to hint the graph preparing right num of buffers in pipeline.
        self.m_stage0.config.stage_id = 0 
        self.m_stage1.config.stage_id = 1
        self.config.set_gradient_accumulation_steps(4)        

    def build(self, x):
        x = x.to_global(placement=P0, sbp=B)
        y = self.m_stage0(x)
        # Move tensor between different pipeline stages.
        y = y.to_global(placement=P1, sbp=B)
        z = self.m_stage1(y)
        return z

v0.8.0

class Graph(flow.nn.Graph):
    def __init__(self):
        super().__init__()
        self.m_stage0 = flow.nn.Linear(8, 8, False).to_global(placement=P_0, sbp=B)
        self.m_stage1 = flow.nn.Linear(8, 8, False).to_global(placement=P_1, sbp=B)
        # set_stage(stage_id, placement)
        # The Stage ID is numbered starting from 0 and increasing by 1.
        # The Placement is all tensors placement of this module.
        self.m_stage0.config.set_stage(stage_id=0, placement=P_0)
        self.m_stage1.config.set_stage(stage_id=1, placement=P_1)
        self.config.set_gradient_accumulation_steps(4)
    
    def build(self, x):
        # There will be automatically do tensor.to_global(placement) for all input tensor of this module.
        # So there is no need to write to_global() in/out of the module forward function.
        y = self.m_stage0(x)
        z = self.m_stage1(y)
        return z

New Features

Graph

Debug

  • Graph.debug offered the new parameter: max_stack_depth (default = 2) to note the maximal stack depth of the Python stack where the op exists in Graph, making it convenient to locate the Python context for each op in Graph. (https://github.com/Oneflow-Inc/oneflow/pull/8028)

  • Apart from supporting printing the input/output/variable info of modules in Graph, it also newly supported printing operator info constructed in module forward. (https://github.com/Oneflow-Inc/oneflow/pull/8135)

  • Enabled export ONEFLOW_DEBUG_MODE=true and export GLOG_v=3 to print the full memory log, which contains multi-level MemBlock info on each device (Total Memory-> Chunk -> MemBlock), Block that has exclusive memory, Eager Variable and other information. Besides, a lifecycle label was added in Regst to analyze each tensor's memory lifecycle.

  • LightPlan provided a more simplified way to display Actor Graph, cutting down the cost of debug based on Plan. When ONEFLOW_DEBUG_MODE = true , a series of light plan files corresponding to each rank in Graph will be generated under the log/local_rank_0/machine/ directory, containing simplified actor sub-graphs in each rank, and the filename is GraphName_rank_i_light_plan. (https://github.com/Oneflow-Inc/oneflow/pull/8396)

  • The print graph method allowed to display the logic graph by Module, making the debugging more efficient in constructing graphs. (https://github.com/Oneflow-Inc/oneflow/pull/8131)

Eager

Tensor

Global Boxing

OneEmbedding

For better recommendations, modern recommendation systems always rely on huge Embedding tables. Besides, frequent iterations of user data require model training to be fast enough.

OneEmbedding is a component designed for large-scale recommendation systems, and it's efficient, extensible, and highly flexible. The following are its advantages:

  1. Hierarchical storage and dynamic capacity expansion: users can expand the capacity of the Embedding at much lower cost.

  2. Mixed parallelism strategy: it supports easily extending the model to train it on multi-machine multi-GPU.

  3. Embedding quantization for better communication: in the parallel scenario, communication data can be quantized to reduce the communication amount, thus accelerating the training.

  4. Efficient data pipeline: the model parts that have no data dependency can be executed in advance, thus overlapping with other operations in time.

  5. Automatic mixed precision training: data can be computed in FP16 to reduce the occupied memory, thus accelerating the training speed and ensuring high model convergence precision.

  6. A collection of efficient CUDA ops for common operations in recommendation systems is available.

  7. Flexible model building is supported.

See OneEmbedding API documentation from here.

PyTorch Compatibility

A collection of new functionalities and interfaces that are compatible with PyTorch 1.10.0 have been added.

Tensor

Operators

Random

  • Added new interfaces: oneflow.cuda.manual_seed, oneflow.cuda.manual_seed_all, oneflow.seed, oneflow.manual_seed, oneflow.initial_seed, oneflow.get_rng_state, oneflow.set_rng_state and improved the configuration of OneFlow random seed initialization. (https://github.com/Oneflow-Inc/oneflow/pull/7957 )

AutoGrad

CUDA

RNN

  • Refactored the Module of RNN and migrated the implementation of Python layer splicing to C++, which greatly optimized the performance. Added modules related to RNNCell and modules aligned with the torch.nn.utils.rnn in functionality:

    • Refactored modules: RNN, LSTM, and GRU
    • Added modules: RNNCell , LSTMCell, GRUCell, andoneflow.nn.utils.rnn
    • Supported and fixed RNN unit tests of local and global, and completed documentation.

Device

Supported heterogeneous equipment type: In order to cope with the complexity of different hardware, OneFlow, in line with the dependency inversion principle in software engineering, has introduced a hardware abstraction layer called Execution Provider (EP). The hardware abstraction layer is composed of a series of interfaces, which are abstracted from the capabilities provided by the required hardware devices during the running of the framework. After the hardware abstraction layer is introduced, each modules can directly call the interface provided by the hardware abstraction layer, not the original hardware interface, to use the underlying hardware, so it's unneccessary to concern the specific details of the underlying hardware. When a new hardware device is introduced, because the hardware abstraction interface remains unchanged, all modules can adapt to the new hardware device without any modification. At the same time, when adapting new hardware for the framework, we do not need to pay attention to the specific implementation details of the framework. We only need to implement a series of interfaces according to the agreement of the hardware abstract interface and the actual situation of the hardware device, and then the hardware adaptation can be completed.

Execution Provider has defined a collection of runtime interfaces: device registration interface, device management interface, queue management interface, event management interface, and memory management interface.

Primitive

In addition to the runtime interfaces, the Execution Provider has also defined a set of computing interfaces called Primitive, which are used to describe the commonly-used computation in the deep learning framework, thus simplifying the development of operators in hardware adaptation. Compared with the runtime interfaces provided by the Execution Provider, the interfaces provided by Primitive are more loose and flexible. All interfaces are mutually independent, and each interface represents a specific computing capability provided by a certain hardware device. Similar to runtime interfaces, the abstraction of interfaces provided by Primitive is closer to the device side, and developers can carry out adaptation work without an in-depth understanding of OneFlow's mechanism. Developers need to implement all interfaces provided by Execution Provider when adapting runtime interfaces, but in the process of adapting Primitive, developers can selectively adapt according to the actual situation of the project.

Debug tools

OneFlow-Profiler

OneFlow-Profiler is designed to collect various performance-related information during the execution flow of the framework. It can calculate the execution time of the operator or system components, the allocation of memory and DRAM, and can record the input and parameter information corresponding to the operator. This information can be used by developers to analyze which part brings the most overhead and implement some targeted optimizations.

Auto-Test

AutoProf

AutoProf is a framework designed to test the performance of OneFlow and PyTorch operators. It can automatically test the operator performance and print a comparison table under different CPU threads and GPUs. At present, it has been applied to the development of some existed operators and all new operators. Its effect is shown below:

IR

Performance

Graph

Eager

Operators & Tensor

Primitive

  • Lowered the elementwise.cuh template's requirement for pointer alignment.

Improvements

Graph

Eager

Operators & Tensor

Device

Tests

Eager Global Module Tests:

In 0.8.0, we have completed the ability of all kernels to deal with global tensor in distributed situation, and fixed many known bugs related to sbp. The global tensor worked efficiently and correctly at the kernel level. No matter how the distributed topology structure changes, the same algorithm logic can efficiently get mathematically consistent results, which greatly reduced the trouble of verifying correctness in the complex, diverse and asymmetric distributed parallel training process.

module/functional op PR
abs Oneflow-Inc/oneflow#7540
0_dim_tensor Oneflow-Inc/oneflow#7540
activation Oneflow-Inc/oneflow#7540
adaptive_pool Oneflow-Inc/oneflow#7563
addmm Oneflow-Inc/oneflow#7565
add Oneflow-Inc/oneflow#7204
affine_grid Oneflow-Inc/oneflow#7578
arange Oneflow-Inc/oneflow#7576
argmax Oneflow-Inc/oneflow#7579
argmin Oneflow-Inc/oneflow#7581
argsort Oneflow-Inc/oneflow#7582
argwhere Oneflow-Inc/oneflow#7584
avgpool Oneflow-Inc/oneflow#7585
batch_gather Oneflow-Inc/oneflow#7590
bernoulli Oneflow-Inc/oneflow#7732
bmm Oneflow-Inc/oneflow#7741
broadcast_like Oneflow-Inc/oneflow#7742
cast Oneflow-Inc/oneflow#7773
ceil Oneflow-Inc/oneflow#7744
chunk Oneflow-Inc/oneflow#7750
clamp Oneflow-Inc/oneflow#7752
clip_grad Oneflow-Inc/oneflow#7757
concat Oneflow-Inc/oneflow#7204
conv1d Oneflow-Inc/oneflow#7769
conv2d Oneflow-Inc/oneflow#7771
conv3d Oneflow-Inc/oneflow#7771
cumsum Oneflow-Inc/oneflow#7772
deconv2d Oneflow-Inc/oneflow#7772
diagonal Oneflow-Inc/oneflow#7772
diag Oneflow-Inc/oneflow#7421
div Oneflow-Inc/oneflow#7421
dot Oneflow-Inc/oneflow#7421
dropout Oneflow-Inc/oneflow#7772
empty Oneflow-Inc/oneflow#7508
eq Oneflow-Inc/oneflow#7421
erfc Oneflow-Inc/oneflow#7421
erf Oneflow-Inc/oneflow#7421
expand Oneflow-Inc/oneflow#7772
expm1 Oneflow-Inc/oneflow#7421
eye Oneflow-Inc/oneflow#7421
flatten Oneflow-Inc/oneflow#7421
flip Oneflow-Inc/oneflow#7496
floor Oneflow-Inc/oneflow#7421
fmod Oneflow-Inc/oneflow#7421
fold Oneflow-Inc/oneflow#7772
greater_equal Oneflow-Inc/oneflow#7421
greater Oneflow-Inc/oneflow#7366
fused_bias_add_dropout Oneflow-Inc/oneflow#7867
fused_bias_add_gelu Oneflow-Inc/oneflow#7867
fused_scale_mask_softmax_dropout Oneflow-Inc/oneflow#7867
fused_scale_mask_softmax Oneflow-Inc/oneflow#7867
fused_scale_tril Oneflow-Inc/oneflow#7867
fused_self_attention Oneflow-Inc/oneflow#7867
fused_tril_softmax_mask_scale Oneflow-Inc/oneflow#7867
gather_nd Oneflow-Inc/oneflow#7880
gather Oneflow-Inc/oneflow#7880
glu Oneflow-Inc/oneflow#7880
grid_sample Oneflow-Inc/oneflow#7881
groupnorm Oneflow-Inc/oneflow#7885
masked_fill Oneflow-Inc/oneflow#7457
masked_select Oneflow-Inc/oneflow#7492
math_ops Oneflow-Inc/oneflow#7461
matmul Oneflow-Inc/oneflow#7465
maxpool Oneflow-Inc/oneflow#7683
max Oneflow-Inc/oneflow#7450
mean Oneflow-Inc/oneflow#7650
meshgrid Oneflow-Inc/oneflow#7533
min_max_observer Oneflow-Inc/oneflow#7725
min Oneflow-Inc/oneflow#7450
movedim Oneflow-Inc/oneflow#7679
moving_average_min_max_observer Oneflow-Inc/oneflow#7726
mul Oneflow-Inc/oneflow#7717
narrow Oneflow-Inc/oneflow#7647
negative Oneflow-Inc/oneflow#7644
ne Oneflow-Inc/oneflow#7642
nms Oneflow-Inc/oneflow#7536
nonzero Oneflow-Inc/oneflow#7645
normalize Oneflow-Inc/oneflow#7635
ones_like Oneflow-Inc/oneflow#7635
parital_fc Oneflow-Inc/oneflow#7534
permute Oneflow-Inc/oneflow#7635
prod Oneflow-Inc/oneflow#7635
randint Oneflow-Inc/oneflow#7508
rand Oneflow-Inc/oneflow#7508
reshape Oneflow-Inc/oneflow#7472
roi_align Oneflow-Inc/oneflow#7794
scatter_nd Oneflow-Inc/oneflow#7807
scatter_ops Oneflow-Inc/oneflow#7807
sign Oneflow-Inc/oneflow#7818
slice Oneflow-Inc/oneflow#7818
softplus Oneflow-Inc/oneflow#7818
sparse_softmax_cross_entr Oneflow-Inc/oneflow#7298
split Oneflow-Inc/oneflow#7277
sqrt_square_sum Oneflow-Inc/oneflow#7277
squeeze Oneflow-Inc/oneflow#7289
stack Oneflow-Inc/oneflow#7289
stateful_kernel_with_cache Oneflow-Inc/oneflow#7289
std Oneflow-Inc/oneflow#7303
sub Oneflow-Inc/oneflow#7303
sum Oneflow-Inc/oneflow#7303
tensor_ops Oneflow-Inc/oneflow#7307
tensor_scatter_nd_update Oneflow-Inc/oneflow#7308
tile Oneflow-Inc/oneflow#7322
transpose Oneflow-Inc/oneflow#7332
tril Oneflow-Inc/oneflow#7322
TripletMarginLoss Oneflow-Inc/oneflow#7332
triu Oneflow-Inc/oneflow#7882
unfold Oneflow-Inc/oneflow#7883
unfold_tensor Oneflow-Inc/oneflow#7883
unsqueeze Oneflow-Inc/oneflow#7882
upsample Oneflow-Inc/oneflow#7884
var Oneflow-Inc/oneflow#7891
view Oneflow-Inc/oneflow#7886
weight_norm Oneflow-Inc/oneflow#7886
where Oneflow-Inc/oneflow#7886
zeropad2d Oneflow-Inc/oneflow#7886

EP::Primitive

Completed some unit tests of Primitive log_softmax, softmax, copynd, Memset, Memcpy, matmul, add, binary, unary, matmul, batch_matmul, fill etc. (https://github.com/Oneflow-Inc/oneflow/pull/8132, https://github.com/Oneflow-Inc/oneflow/pull/8139, https://github.com/Oneflow-Inc/oneflow/pull/8137, https://github.com/Oneflow-Inc/oneflow/pull/8109, https://github.com/Oneflow-Inc/oneflow/pull/8143, https://github.com/Oneflow-Inc/oneflow/pull/8108, https://github.com/Oneflow-Inc/oneflow/pull/8154, https://github.com/Oneflow-Inc/oneflow/pull/8154, https://github.com/Oneflow-Inc/oneflow/pull/8118https://github.com/Oneflow-Inc/oneflow/pull/8291)

Exception

Improve exception error handling

Build

CI

Improve the running speed and stability of CI

Models

Bug fixes

Graph

Eager

Operators & Tensor

Global Tensor

Tensor

Scalar Tensor

Fixed failure of gather to support Scalar Tensor (https://github.com/Oneflow-Inc/oneflow/pull/8376)

0-Size Tensor

Operators

Device

Higher order derivative

Build

CI

Module

Documentation

oneflow - Version 0.7.0

Published by jackalcooper over 2 years ago

OneFlow v0.7.0 Release Notes

OneFlow v0.7.0 came out. Welcome to use it. We would love to hear your feedback!

本文的中文版本

https://mp.weixin.qq.com/s/dSR-2Xw92eoFhF0c6MtutQ

Highlights

This release has the following highlights:

  1. Provides a Tensor that can be executed in multi-nodes multi-GPUs scenarios: Global Tensor. It is an easy-to-use solution for distributed execution. It makes it easier to implement various distributed parallel strategies and enables more flexible and user-friendly distributed implementation. It supports models including ResNet50, Wide and Deep, GPT, Bert, Swin-Transformer, InsightFace, etc.

  2. Continues to improve nn.Graph. Supports the advanced features such as ZeRO, GradAcc, Checkpointing, and Pipelining, and enriches the graph.debug mode. Supports random 2D SBP conversion, semi-automatic derivation of 2D SBP, resuming training from the last checkpoint, etc. Adds OneFlow Feature Stages Identifications and identifies each feature of nn.Graph. For nn.Graph, its basic features are at the Beta Stage, which can meet most of the requirements of users; Advanced features are at Alpha Stage, meeting standard requirements.

  3. Deeply optimizes the performance of Eager mode. The performance of the Swin-Transformer model is 3 times higher than that of v0.6.0 when tested on the V100.

  4. Operators-related improvements: In the single-node single-GPU scenario, OneFlow's compatibility with PyTorch is further improved. The interfaces, semantics, and produced results of operators supported by OneFlow are in consistent with that of operators supported by PyTorch and an automatic testing framework is designed to verify the consistency. With common models, you can accomplish the migration by running import oneflow as torch. Compared with v0.6.0, OneFlow adds 16 operators, optimizes the performance of 6 operators, and fixes bugs in 16 operators.

  5. Supports Einsum and View mechanism.

  6. Compiler-related improvements: OneFlow is officially connected to the MLIR ecosystem.

  7. Releases OneFlow-Serving v0.1.0: We provide an out-of-the-box Triton OneFlow backend docker image. try here.

  8. Releases LiBai v0.1.0, a toolbox for massively distributed parallel training of Transformer. Compared with customized code bases such as Megatron-LM, LiBai provides a series of models and training components for distributed training based on a modular design, aiming to make models trained in distributed mode as convenient as in single-GPU mode.

  9. Releases Flow-Vision v0.1.0: adds DeiT, ConvNeXt, ReXNet, and other models and updates tutorials and documentation.

OneFlow Feature Stages identifications

OneFlow Feature Stages identifies the maturity level of OneFlow features. It provides users with a status description of a feature to inform the specific level of it, such as completeness, API stability, documentation, etc. It Provides OneFlow developers with a standard for feature refinement, which facilitates further improvement.

OneFlow Feature Stages

  • Stable Stage

    • Purpose: release for production use
    • Audience: all users
    • Functionality: same as RC
    • Testing: same as RC
    • Performance: same as RC
    • API: same as RC, with stability within long cycles (e.g., 1 year) and large versions (e.g., 1.0)
    • Documentation: same as RC
  • Release Candidate (RC) Stage

    • Purpose: release for deployment evaluation in production environments
    • Audience: all users, including those who want to deploy production environments
    • Functionality: being able to handle exceptions as well as normal inputs.
    • Testing: end-to-end deployment validated in external environment with good experience
    • Performance: provide evaluation reports and documentation to evaluate performance and scalability in external environments
    • API: API for external user evaluation
    • Documentation: features in this stage are added to the core-feature-set documentation
  • Beta Stage

    • Purpose: release to provide a relatively stable, complete, and available version
    • Audience: all users, especially those with strong feature demands, little concern for unknown trivial issues, and willingness to provide feedback
    • Functionality: complete functionalities addressing the needs of various possible scenarios
    • Testing: complete, covering various corner test cases, and various end-to-end integration tests
    • Performance: performance evaluation and scalability evaluation
    • API: recognized as complete and stable by seed users after full review
    • Documentation: tutorials that describe the usage process
  • Alpah Stage

    • Purpose: release to get early feedback for experimental features
    • Audience: developers and expert users
    • Functionality: core functionality completed
    • Testing: unit testing completed for core requirements of the feature, possibly with unknown bugs
    • Performance: evaluated
    • API: well-defined but not rigorously reviewed, possibly requiring further changes
    • Documentation: API documentation is a must to provide feature definitions
  • Pre-alpha Stage

    • Purpose: release to validate feature prototypes or address urgent needs
    • Audience: feature developers
    • Functionality: limited prototype functionalities
    • Testing: limited testing, possibly with many bugs
    • Performance: unknown
    • API: prone to changes
    • Documentation: possibly none

OneFlow Framework

1. Distribution

Global Tensor

Global Tensor is a newly released set of distributed computing interfaces. It can easily support any parallelism including data parallelism, model parallelism, and pipeline parallelism. Unlike a normal Tensor (hereafter called Local Tensor), Global Tensor is a Tensor with a global view, whose data is distributed in a specific way across a set of devices in a cluster, and each node stores some or all of the Global Tensor's data. Placement and SBP are the basic properties of the Global Tensor that describe the distribution of the data in clusters.

Global Tensor's data distribution

Global Tensor supports three different ways of data distribution, which we collectively refer to as SBP.

  • Split (dim): The data is equally split along dim dimension and distributed to each device.
  • Broadcast: The data is replicated between each device.
  • PartialSum: The data is the element-wise addition for each device.

Consistent computational interfaces

Global Tensor has basically the same computational interfaces as Local Tensor. Only with small changes, you can convert the single-GPU mode to the distributed mode.

>>> import oneflow as flow
>>> x = flow.tensor([1.0, 2.0])
>>> y = x * x

>>> import oneflow as flow
>>> x = flow.tensor([1.0, 2.0],
            placement=flow.placement("cuda", ranks=[0, 1]),
            sbp=flow.sbp.split(0))
>>> y = x * x
# This multiplication is performed on both rank 0 and rank 1

Supporting conversion between Local Tensor and Global Tensor

  • With Tensor.to_global interface, you can create a Global Tensor based on a Local Tensor, and regard this tensor as the local tensor of the Global Tensor on the present device.

  • With Tensor.to_local interface, you can return the local tensor of the Global Tensor on the present device.

>>> import oneflow as flow
>>> x = flow.tensor([1.0, 2.0],
            placement=flow.placement("cuda", ranks=[0, 1]),
            sbp=flow.sbp.split(0))
>>> y = x.to_local()
>>> y.size()
oneflow.Size([1])
>>> y
tensor([1.], device='cuda:0', dtype=oneflow.float32)
# tensor([2.], device='cuda:0', dtype=oneflow.float32) if rank is 1

Supporting redistribution of Global Tensor in clusters

With Tensor.to_global interface, you can redistribute the data of Global Tensor in clusters. The data can be distributed to another set of nodes and the way of distribution in this set of nodes can also be changed (i.e.change SBP). Redistribution usually generates inter-process data communication, but Tensor.to_global interface finely avoids complicated low-level communication details.

>>> import oneflow as flow
>>> x = flow.tensor([1.0, 2.0], placement=flow.placement("cuda", ranks=[0, 1]), sbp=flow.sbp.split(0))
>>> y = x.to_global(placement=flow.placement("cuda", ranks=[2, 3]), sbp=flow.sbp.broadcast)

Each operator of OneFlow defines a set of SBP signatures for the input and output tensor. Global Tensor supports automatic redistribution to provide the required SBP signature of a certain interface. Just as the code shown below:

>>> import oneflow as flow
>>> x = flow.randn(4, 4, 
            placement=flow.placement("cuda", ranks=[0, 1]), 
            sbp=flow.sbp.split(0))
>>> y = flow.randn(4, 4, 
            placement=flow.placement("cuda", ranks=[0, 1]), 
            sbp=flow.sbp.split(1))
>>> z = x + y

When x + y is executed, since x is split along 0 dimension while y is split along 1 dimension, their local tensors at each device can not be added up directly. Therefore, x's SBP will be automatically converted to flow.sbp.split(1) or y's SBP will be converted to flow.sbp.split(0), and the calculated result-z's SBP- is flow.sbp.split(1) or flow.sbp.split(0).

Notes

  • Global Tensor doesn't support mix-in with DDP interface currently.

  • Global Tensor requires all devices to execute simultaneously, and the code that has branches would lead to process deadlock because of divergent execution paths. We will continue fixing this problem.

2. Continued improvement of nn.Graph's features

Overview of the development of nn.Graph v0.7.0

  • Fundamental features enter into Beta Stage, meeting most requirements of users;

  • Advanced features enter into Alpha Stage, meeting standard requirements of users;

  • ResNet50, Wide and Deep, GPT, Bert, Swin-Transformer, InsightFace, and other models are supported;

Feature of nn.Graph

  • Static and dynamic casting of operators under Static Graph enter into Beta Stage from Alpha Stage

    • Adds the unit test of static execution for all legal operators under nn.Graph, and automated unit test is ready;

    • Supports more flexible inputs and outputs, including List/Tuple/Dict and their nesting, and fixs the Tuple problem of producing a return size of "1";

    • Adds backward automatic test;

  • Optimizer and LR Scheduler under Static Graph enter into Beta Stage from Alpha Stage.

    • Adds more built-in LR schedulers, including WarmupLR, CosineAnnealingWarmRestarts and other common schedulers, and provides SequentialLR and ChainedScheduler to enable scheduler with different combination capacity;

    • Refactors scheduler's get_lr function, converting it to the implementation of pure function. This change permits to use schedulers in combination by changing the calculation of lr from iterative solution to analytical solution;

    • Adds "is_sparse" parameter for add_optimizer interface, supporting sparse updates under graph mode. Optimizers that support sparse updates include Adam and SGD, while optimizers under Eager mode don't support sparse updates yet. Subsequent version will support both sparse updates and sparse tensor. The feature is at Pre-alpha Stage;

    • Adds Debug print feature for LR and Step, for which you only need to turn on LR Scheduler's verbose button.

  • state_dict and load_state_dict under Static Graph are newly added, which allow to resume training from last checkpoint. The feature is at Beta Stage;

  • Debug under Static Graph enters into Beta Stage from Alpha Stage;

    • Adds debug(2)debug(3) that allow to find out problems in nn.Module, by locating the Python code of operators at c++ layer and locating forward graph creation and inference for operators;

    • Adds the display of memory overhead

  • ZeRO-DP under Static Graph is newly added, which allows to reducememory overhead related to Optimizer under data parallelism, and the feature is at Alpha Stage;

  • Global Tensor under Static Graph supports multiple parallel methods, and the feature is between Alpha Stage and Beta Stage;

    • It is utilized in LiBai and other model libraries;

    • It is widely utilized in OneFlow's model libraries, and the coverage of unit test is still ongoing;

    • 1D Global Tensor supports you to only define input tensor's SBP, while output tensor's SBP can be derived automatically with good results, and the feature is at Beta Stage;

    • 2D Global Tensor supports you to only define input tensor's SBP, while output tensor's SBP can be derived automatically with good results, and the feature is at Alpha Stage;

    • Conversion from 1D to ND or ND to 1D is newly supported, and the feature is at Alpha Stage;

    • Random conversion of 2D SBP is newly supported, and the feature is at Alpha Stage;

    • Testing of 1D&2D single operator is still ongoing, and the feature is at Pre-alpha Stage;

    • Selecting SBP with semi-automatic derivation is supported, and the feature is at Pre-alpha Stage;

  • For Gradient Accumulation under Static Graph, we refactor and repair support for Reshape and add API documentation. For the input of mini-batch interface, the future version will offer the input of micro-batch with better experience, and the feature is from Pre-Alpha to Alpha Stage;

  • For pipeline parallelism under Static Graph, the tutorial is perfected, and pipeline parallelism is available in Libai and other model libraries. The feature is at Beta Stage;

  • For automatic mixed precision (AMP) under Static Graph, the API documentation is newly added. The feature is from Pre-Alpha to Alpha Stage;

  • For Activation Checkpointing under Static Graph, the API documentationis newly added. The feature is from Pre-Alpha to Alpha Stage;

  • For Op Fuse optimization under Static Graph, the API documentationis newly added. The feature is from Pre-Alpha to Alpha Stage;

  • For XLA/TensorRT/OpenVINO execution under Static Graph, the API documentationis newly added. The feature is from Pre-Alpha to Alpha Stage;

Tutorials

API Documentation

Tutorials of pipeline parallelism:

Model support under nn.Graph

3. Performance optimization of Eager

  • The performance of Eager is deeply optimized. When OneFlow run Swin-Transformer's model performance on V100 GPU, single-GPU card delivers a 25% speedup than PyTorch, and 8 single GPU card 10% speedup;

  • The communication scheduling policy for NCCL in DDP is optimized;

  • DDP supports the optimization of AllReduce fuse, reducing additional overhead generated by fragmented AllReduce, with a 5% performance speedup when it is tested on ResNet50;

  • VM supports the optimization of instruction fusion, significantly saving scheduling overhead of Kernel;

  • Additional memory overhead is optimized when CPU overload is too high;

  • Eager DataLoader supports the optimization of inter-process memory sharing;

  • The performance of Clip Grad is optimized;

4. Improvements of operators

  • OneFlow is successfully adapted to oneDNN for CPU operators acceleration.

The performance of CPU operators such as unary and binary element-wise is improved by 4 times, and the speed of Swin-Transformer's dataloader is improved by 2.5 times. https://github.com/Oneflow-Inc/oneflow/pull/7319

5. Supporting einsum & view mechanism

Adds einsum operators. einsum provides a set of concise but elegant rules, which can implement tensor operations including but not limited to: inner product, outer product, tensor multiplication, tensor transposition and tensor contraction, etc. Proficient use of einsum allows you to easily implement various complex tensor operations and be less error-prone. https://github.com/Oneflow-Inc/oneflow/pull/7526

Adds view mechanism. The view mechanism allows the common operators to reuse/share Tensor's memory, and the memory can be saved by reducing the Kernel Launch/Compute process. At present, new view operators that do not change the tensor.is_contiguous() property have been added, such as reshape, view, squeeze, unsqueeze, etc.: https://github.com/Oneflow-Inc/oneflow/pull/7503 More view operators will be added later (such as transpose, permute, narrow, expand, and unfold).

6. Improvements of the complier

  • OneFlow is officially connected to the MLIR ecosystem, and the OneFlow Dialect component is complete. Successfully completes OneFlow Job (computation graph of OneFlow nn.Graph) and RoundTrip of MLIR, and runs RoundTrip tests on all operators of OneFlow in CI process.

  • Implements static graph optimization with a series of automatic fused operators based on MLIR DRR to accelerate OneFlow model training and inference.

7. OneFlow Serving

OneFlow Serving v0.1.0 comes out with the following features:

  • Provides OneFlow C++ API used for inference, supporting model loading and static graph inference.

  • The model weights and the computation graph in MLIR format can be saved simultaneously by running flow.save(graph) in Python. They can be loaded in C++ API (while loading computation graph is not supported in Python API at present).

  • Supports inference of OneFlow model using TensorRT and OpenVINO automatically without model conversion (based on OneFlow XRT module), achieving better acceleration on NVIDIA GPU and Intel CPU.

  • Implements Triton OneFlow backend

    • Provides out-of-the-box Docker image.
    • Supports auto configuration: only the model path needs to be given, and no Triton configuration file needs to be written in the configuration.
  • Welcome to use the project deployed with Triton OneFlow backend launched on OneFlow Cloud Platform.

8. LiBai

LiBai is a toolbox for massively distributed parallel training of Transformer. Compared with custom code bases such as Megatron-LM, LiBai provides a series of models and training components for distributed training based on a modular design, aiming to make models trained in distributed mode as convenient as in single-GPU mode. The 0.1.0 version mainly supports the following features and models:

Features:

  • Data Parallelism
  • 1D Tensor Parallelism
  • Pipeline Parallelism
  • Unified Distributed Layers
  • Extensible for new parallelism
  • Mixed Precision Training
  • Activation Checkpointing
  • Gradient Accumulation
  • Gradient Clip
  • ZeRO
  • More flexible "LazyConfig" configuration system
  • Easy-to-use Trainer and Evaluator
  • Data preprocessing supporting images and texts

Models:

  • Bert (3D Parallelism)
  • GPT-2 (3D Parallelism)
  • ViT (3D Parallelism)
  • Swin-Transformer (Data Parallelism)
  • Supports fine-tuning tasks in projects/
  • Supports text classification tasks in projects/

9. flow-vison

flowvision 0.1.0 stable version comes out with the following improvements based on the previous version:

  • Adds initialization method trunc_normal_
  • Adds DeiT model, rebuilt VisionTransformer model
  • Adds ConvNeXt model
  • Adds ReXNet model
  • Supports Learning Rate Schedule in PolyLRScheduler and TanhLRScheduler
  • Fixes the use of F.normalize in SSD model
  • Fixes bugs in EfficientNet and Res2Net
  • Fixes weights problem in vit_small_patch32_384 and res2net50_48w_2s models
  • Rebuilds model zoo and runs more complete tests on existing models
  • Rebuilds load_state_dict_from_url method to automatically save the downloaded weights in the cache folder
  • Improves documents about Getting Started and flowvision.models

The 0.2.0 version of flowvision is already in progress. A large number of new models will be added based on the 0.1.0 version, and the documentation will be improved, so stay tuned.

oneflow - Version 0.6.0

Published by jackalcooper almost 3 years ago

OneFlow v0.6.0 Release Notes

OneFlow has been open sourced for 528 days since July 31,2020. Today OneFlow v0.6.0 came out. Welcome to use OneFlow v0.6.0. We would love to hear feedback!

This version mainly updates three parts: framework, model, and OneFlow-ONNX. Hightlights include:

  • Performance optimization in static graphs, dynamic graphs, operators, memory occupation, etc
  • A larger number of common operators
  • Improvements in static graphs and ConsistentTensor
  • Serving functionality as Nvidia Triton's backend
  • Richer visual pre-training models similar to torchvision and timm
  • Better OneFlow-ONNX conversion functionality

The following are the detailed release notes.

Framework

1. Performance Optimization of nn.Graph

  • Compared to v0.5.0, nn.Graph in v0.6.0 delivers a 10% speedup in training on models such as ResNet AMP and WDL, etc
    • Optimized nn.Graph's performance in high frequency iterative training scenarios
    • Redesigned the scheduling instructions of nn.Graph and refactored the interaction logic between Actor Graph and Eager VM so that the runtime execution of the Graph is asynchronous and parallel to Python input/output Tensor as much as possible

2. Performance Optimization of Eager

  • Compared to v0.5.0, v0.6.0 OneFlow Eager's training speed increases dramatically in small batch scenarios
    • Optimized the scheduling logic for virtual machines
    • Optimized get/set item
    • Optimized tensor.numel()
    • Optimized oneflow.Size()

3. Performance Optimization of Operators

  • Optimized some operators that affect the performance of new model to significantly improve the training speed of these models

4. Performance Optimization of Eager's Memory Occupation

  • Optimized some operators' memory occupation during net training, making the same computing device run bigger models or data
    • Optimized the backward memory occupation of broadcast binary operators
    • Optimized the backward memory occupation of Slice operator
    • Optimized the memory occupation of LayerNorm operator

5. More Useful Features to Static Computation Graph (nn.Graph)

  • The newly added features are related to the effeciency, debugging, completeness, and usability of static graphs
    • To help the debugging of static graphs, we added the following features:
      • debug mode supports graph.debug(1) printing more information about the composition
      • Provided the environment variable ONEFLOW_DEBUG_PASS to show the changes in the computed graph before and after compile-time optimization
      • Added user-readable thread naming information to Nsight Profile for locating and retrieving target key thread locations
      • Added many static graph test cases and added automatic nn.Graph tests that accompany Eager tests
    • Provided graph.save() and load() interfaces to support the deployment of models (Serving) using nn.Graph
    • To do AMP acceleration on GPUs which use TensorCore, the environment variable ONEFLOW_ENABLE_NHWC is provided to indicate the CNN-related operators for channels last calculation
    • Enabled nn.Graph to support more usage scenarios:
      • Supported for Sparse Update Optimizer for sparse update of parameters in WDL scenarios
      • Supported for using the following nn.Module Containers with nn.Graph:
        Sequential, ModuleList, ModuleDict, ParameterList, and ParameterDict
      • Supported for creating Optimizer in the init function of nn.Graph
      • Supported multiple parameters sharing the same Tensor with nn.Graph
      • Supported for scenarios where the actual number of processes is greater than the number of GPU devices
      • Supported more Inplace execution for Consistent SBP inference under nn.Graph

6. A Larger Number of Operators

7. User-Defined autograd.Function

Users can customize autograd.Function just like using Torch.

8. Added Basic Serving Functionality

Serving functionality of models is provided by OneFlow as Nvidia Triton's backend.

9. Added Some Functionalities of Tensor (ConsistentTensor)

  • Supported Tensor using 2-D SBP to represent arbitrary hybrid parallelism (such as a Linear operation that runs data parallelism in the row direction of the device matrix and model parallelism in the column)
  • Supported Tensor's conversion from arbitrary 1-D SBP to 2-D SBP (the network consists of a mixture of 1-D parallel and 2-D parallel)
  • Supported constructing ConsistentTensor from numpy
  • oneflow.from_numpy()
  • oneflow.numel()
  • tensor.expand_as()

Model

Released flowvision 0.0.54.

1. Richer Visual Pre-training Models

Image Classification

  • CNN series: ResNet, DenseNet, VGG, ResNext, EfficientNet, etc
  • Vision Transformer series: ViT, PVT, Swin-Transformer, etc
  • Vision MLP series: Mlp-Mixer, Res-MLP, g-MLP, etc

Object Detection

  • SSD, SSDLite
  • Faster R-CNN
  • RetinaNet

Image Segmentation

  • FCN
  • DeepLabV3

Style Migration

  • StyleNet: Suport Styles sketch, candy, mosaic, rain_princess, and undie

2. Implemented Data Augmentation Operations Similar to torchvision

For data augmentation operations like CenterCrop and ColorJitter similar to torvhvision, developers can run import flowvision as torchvisionto execute in most scenarios.

3. Implemented Advanced Data Augmentation Opertations Similar to timm

Advanced data augmentation opertations implemented in flowvision.data:

  • Mixup
  • CutMix
  • Random-Erasing
  • AutoAugment
  • RandAugment
  • AugMix

4. Separated the Layers Module and Provided a Plug-and-play Block when Building a Model

flowvision.layers.attention

  • Implemented plug-and-play attention models like Non-Local, SELayer, CBAM, BAM, ECA, etc

flowvision.layers.blocks

  • Provided modules that might be used for model building like PatchEmb, Pooler, ConvBnAct, etc

flowvision.layers.regularization

  • Provided regularization modules such as drop-path, drop-block, and stochastic depth to improve model generalization ability
  • Provided separate files such as activation and weight_init to improve components like activation function and initialize method

OneFlow-ONNX Conversion

Updated OneFlow to ONNX toolkit:

  • Supported OneFlow model converting to ONNX model in CPU or GPU mode
  • Added test cases for operators and models to align all classification models in OneFlowVision library
  • Fixed onnx-runtime bugs during PReLU conversion
  • Compatible with v1.9.0 onnx-runtime library or later versions
  • Released v0.5.4 oneflow-onnx package, and developers can run pip install oneflow-onnx to experience
oneflow - v0.5.0

Published by jackalcooper about 3 years ago

Changelog

v0.5.0 (8/10/2021)

Highlights

  • First class support for eager execution. The deprecated APIs are moved to oneflow.compatible.single_client
  • Drop-in replacement of import torch for existing Pytorch projects. You could test it by inter-changing import oneflow as torch and import torch as flow.
  • nn.Module for eager execution
  • nn.Graph for lazy execution
  • DDP for data parallel

A sneak peek of the new API

Here is a minimum example showcasing how to incorporate a nn.Module in a nn.Graph and have it run in lazy mode.

class NeuralGraph(flow.nn.Graph):
    def __init__(self, ...):
        super().__init__()
        self.model = model # model is a nn.Module instance

    def build(self, x):
        y_pred = self.model(x)
        return y_pred

graph = NeuralGraph() # to create a nn.Graph instance
y_pred = graph(x) # to run the created nn.Graph

New in Python API

  • [feature][eager][op][test][python][interface] Add test for convtranspose2d #5239
  • [enhancement][python][interface] Add GroupNorm #5175
  • [enhancement][eager][python][interface] [Add] avgpool1d avgpool3d #5165
  • [feature][eager][op][python][interface] Add deconv cpu impl #5224
  • [bug][eager][api][python][interface] Fix acosh bug #5221
  • [feature][eager][op][python][interface] Dev modules ctc loss #5168
  • [bottleneck][bug][documentation][python][interface] Fix meshgrid test bug #5208
  • [eager][documentation][python][interface] Rename CosineScheduler to CosineAnnealingLR #5112
  • [feature][eager][python][interface] Add meshgrid module #5205
  • [enhancement][feature][bug][op][python] support bias in conv2d's parameter list #5322
  • [eager][documentation][api][python][interface] add not_equal, greater_equal and less_equal module #5350
  • [enhancement][eager][python] refine pow module and its test #5319
  • [enhancement][eager][op][python] Add triu op #5329
  • [enhancement][bug][python] Fix optimizer for not supporting all kinds of iterables #5355
  • [bug][python][interface] raise IndexError in get_canonical_index to support for loop #5345
  • [bug][python][interface] tensor slice assign supports broadcasting #5344
  • [enhancement][op][python] add cpu group conv logic #5314
  • [enhancement][python] Add 'nn.Mish' module and corresponding functions #5310
  • [enhancement][build][python] Remove ONNX from setup py #5297
  • [enhancement][python][interface] [add] zeropad2d #5278
  • [feature][system][python][interface] Lazy nn.Graph FeedInputOpExpr #5458
  • [feature][python][interface] integrate nn.image.flip #5411
  • [bug][python] Fix issues in point of MultiClientSession #5469
  • [enhancement][bug][python] update HasAllMultiClientEnvVars() #5459
  • [enhancement][python] Add in_top_k function #5428
  • [enhancement][python] Dev add docstring #5449
  • [feature][api][python] MultiClientSession #5407
  • [documentation][python] remove --user #5431
  • [feature][python][interface] nn.Graph python #5309
  • [feature][python][interface] Fea/nn graph/graph name #5413
  • [bug][python][interface] rm nn.Graph.train #5424
  • [op][documentation][api][python][interface] add bernoulli module #5353
  • [enhancement][python] flow.S/B/P #5306
  • [enhancement][documentation][python] Add instruction on upgrade pip #5400
  • [enhancement][python] Rm oneflow export and experimental #5589
  • [bug][python] Fix nn.graph.utils module conflict #5598
  • [feature][ci][python] Update autotest framework #5520
  • [enhancement][python] copy of_proto_python_dir to compatible_single_client_python #5539
  • [enhancement][api][python] del default env init #5537
  • [enhancement][python] Fix single client using same glog file #5535
  • [bug][api][python] Fix Session TryClose #5531
  • [enhancement][feature][python] split vector-matrix norm #5478
  • [feature][eager][op][python][interface] Add more upsample kernel #5382
  • [enhancement][feature][test][python] add torchstyle unittest #5489
  • [feature][system][python] nn.Graph with training #5662
  • [enhancement][feature][python] Fea/nn graph/block proxy func #5727
  • [enhancement][api][python] consistent_tensor_to_api #5703
  • [feature][eager][op][python] Dev Align torch avgpool #5610
  • [enhancement][python] fix circular deps of sbp python module #5706
  • [documentation][python] [part5]Remove singleclient outdated api #5674
  • [enhancement][python] [part4]Remove singleclient outdated api #5672
  • [bug][op][python] remove outdated code in conv3d #5696
  • [enhancement][test][python] enlarge tolerance of dataloader test #5689
  • [enhancement][test][python] add autotest for some math ops #5646
  • [feature][python] nn.Graph optimizer part 2: add L2, pass job complete, refactor #5604
  • [enhancement][python] Add clip_grad_norm #5299
  • [purge][python] Remove Single-Client API in oneflow default python #5827
  • [bug][python] Fix ddp grad size #5834
  • [enhancement][feature][python] Dev RMSprop graph conf #5768
  • [enhancement][purge][eager][python] remove scale arg in optimizer #5821
  • [enhancement][feature][python] graph/block io check #5803
  • [enhancement][feature][python] Dev adam graph conf #5709
  • [purge][python] [part10]Remove singleclient outdated api #5756
  • [feature][api][python] better repr of nn.Graph for debug #5762
  • [bug][python] fix weight decay in RMSprop #5755
  • [purge][python] [part9]Remove singleclient outdated api #5752
  • [purge][python] [part8]Remove singleclient outdated api #5750
  • [documentation][python] add first batch of methods in oneflow.nn.functional namespace #5693
  • [purge][python] [part6]Remove singleclient outdated api #5704
  • [bug][python] use default_generator.seed() as random_seed in init #5721
  • [bug][system][python] ddp broadcast params and buffers #5913
  • [enhancement][test][python] Add consistent tensor requires grad test #5925
  • [bug][python] wrap flow.nn.init.* with flow.no_grad() #5932
  • [feature][api][python][interface] add clip_grad to optimizer #5817
  • [enhancement][ci][op][test][python] add randperm with test and docs #5680
  • [feature][api][python] Fea/nn graph/ lr_schedule(and cosine lr_sch) and opt_group #5846
  • [bug][python] fix bug of SyncOnMasterFn atexit #5909
  • [purge][python] Delete single client nn modules #6061
  • [enhancement][python] Move framework.distribute to env #6022
  • [bug][python] skip sync when abnormally exiting #6025
  • [feature][python] Fea/nn graph/warmup amp config #5969
  • [documentation][python] add optimizer api docs #6131
  • [documentation][python] add_tensor_api_doc #6127
  • [bug][python] Fix test_grid_sample.py and test_affine_grid.py threshold #6125
  • [documentation][api][python] add doc of graph #6093
  • [bug][python] Fix make of_format fail in ubuntu #6120
  • [feature][api][python][interface] Fea/graph helpers #6088
  • [enhancement][eager][python][interface] Use flow.randint in dataloader #6086
  • [feature][eager][api][python][interface] Import oneflow as torch #6076
  • [enhancement][test][api][python][refactor] rename OfrecordReader to OFRcordReader #6090
  • [purge][python][need-single-client-tests] Delete single client nn modules #6082
  • [enhancement][python] flow.load tolerates FileNotFound fault #6083
  • [feature][python] Fea/pipeline in graph #6105
  • [enhancement][test][python] graph activation checkpointing #6192
  • [enhancement][feature][op][python] rnn test #6165

New in Ops:

  • [enhancement][op][api][refactor] [Functional] Part2: Add partial unary and math functional apis #5218
  • [enhancement][bug][op][interface] Refine deconv kernel #5229
  • [enhancement][op][api][interface] add ReflectionPad2d #5172
  • [feature][eager][op][api][interface] crossentropyloss and nllloss support ignore_index #5195
  • [feature][eager][op][api][interface] Yejiaojiao/dev bcewithlogitsloss #5173
  • [bug][ci][op] Dev user op set default is_dynamic #5223
  • [enhancement][op] add magic method for pow #5199
  • [enhancement][op][interface] add cpu version of upsampling #5194
  • [enhancement][bug][op][api][interface] add ReplicationPad2d #5148
  • [feature][eager][op][api][interface] add kldivloss module #5155
  • [feature][eager][op][documentation][build][api][interface] Add floor module and the corresponding testcases #4964
  • [enhancement][feature][op] Dev conv1d module #5280
  • [enhancement][op] Add ctc_greedy_decoder op #5294
  • [enhancement][op][system] Dev remove default grad func #5320
  • [enhancement][op][system] Add pad grad func. #5354
  • [enhancement][op][system] Add gradient funcs. #5348
  • [feature][purge][bug][eager][op][interface] fix upsample nearest bug #5347
  • [enhancement][op][system] [Functional] Part7: Migrate pooling ops #5253
  • [enhancement][op] nvjpeg hardware acc #5240
  • [enhancement][feature][ci][eager][op][api][interface] Add bmm module #5334
  • [enhancement][eager][op] Dev image decode eager #5333
  • [enhancement][op] Optimize softmax warp impl #4977
  • [enhancement][eager][op] Dev tensor buffer eager #5317
  • [enhancement][op][api][refactor] [Functional] Part6: Migrate conv op #5252
  • [enhancement][eager][op] Dev sort eager #5284
  • [enhancement][bug][op][api] fix bceloss bug in default weight and reduction #5303
  • [bug][eager][op] remove redundant assert and check #5264
  • [enhancement][bug][ci][op] fix bceloss bug about weight #5269
  • [enhancement][op][api][refactor] [Functional] Part5: Migrate nn ops #5249
  • [enhancement][eager][op] Dev argsort eager #5273
  • [enhancement][op][api][refactor] [Functional] Part4: Migrate array ops #5247
  • [enhancement][op][api][refactor] [Functional] Part3: Migrate binary and activation ops #5246
  • [bug][ci][op][test] Dev fix rmsprop ci fail #5481
  • [enhancement][op] add inplace method: Tensor.sin_ #5471
  • [bug][op] hotfix image_batch_align #5461
  • [enhancement][eager][op][interface] Dev maxpool series op 123d #5244
  • [bug][op] fix pool gpu kernel #5446
  • [feature][eager][op][api][interface] add pixelshufflev2 module #5383
  • [enhancement][feature][ci][eager][op][documentation][api][interface] Add flow xxx and tensor xxx autotest #5386
  • [enhancement][feature][eager][op][api][interface] Modules chunk #5324
  • [enhancement][eager][op] add image normalize for eager #5402
  • [enhancement][eager][op] Dev batch align module #5401
  • [enhancement][eager][op] add coco reader module #5391
  • [enhancement][wip][op] Restruct Elementwise kernel #4130
  • [bug][op] Fix DecodeRandom reuse mem #5606
  • [enhancement][op] Align pytorch maxpool #5525
  • [enhancement][bottleneck][eager][op][api] implementation of constantpad-3d op #5529
  • [enhancement][eager][op] Add scale size for resize #5509
  • [enhancement][op][api][refactor] Dev optimize tensor setitem #5501
  • [enhancement][op] register uint8 dtypeto support dataloader #5499
  • [enhancement][op] Add unique.cuh #5487
  • [enhancement][op][api][interface] Dev ofrecord auto truncating #5412
  • [feature][op][system][interface] Feat: LazyInterpret::ApplyImpl support SourceUserOpExpr and Copy #5711
  • [enhancement][op][interface] Dev logical_and/or modules #5636
  • [enhancement][op] support any number positional arguments for ones and zeros op #5698
  • [enhancement][feature][eager][op] Add conv3d Module #5327
  • [feature][eager][op][api][interface] add batchnorm3d module #5631
  • [bug][eager][op] fix reduce min max backward bug #5651
  • [enhancement][op] Debug dim scatter #5371
  • [enhancement][op][interface] Dev eye #5583
  • [enhancement][eager][op] Dev minimum maximum #5576
  • [enhancement][op] Restruct activation grad op #5669
  • [enhancement][feature][eager][op] Rewrite activation function #5465
  • [bug][op][documentation] add oneflow.cat for documentation #5621
  • [enhancement][op] Lcy logsoftmax #5746
  • [feature][op][need-simple-ci] Feat empty op #5659
  • [enhancement][eager][op] Dev split #5714
  • [enhancement][op][interface] add index_select op #5661
  • [bug][op] fix nvjpeg hw acc #5851
  • [enhancement][op] Remove move in conv_cudnn #5828
  • [enhancement][op][interface] Dev logical_xor module #5694
  • [bug][eager][op] fix squeeze #5808
  • [enhancement][op] Get parallel_id and parallel_num through rank and world size in DDP #5717
  • [bug][eager][op] delete interpolate int type #5805
  • [bug][op] Fix bug in scatter #5743
  • [enhancement][op] Refactor: remove module not required, call function directly #5754
  • [enhancement][op] Remove modules not required(tan, erfc, log1p, scatter_nd) #5791
  • [enhancement][op] Refactor scatter, clamp and pow in cpp instead of in python #5715
  • [enhancement][op] Rm useless code in gather files #5687
  • [enhancement][eager][op] change flip_code to scalar #5786
  • [enhancement][bug][op][interface] fix upsample bug #5753
  • [bug][op][interface] Quick fix Lazy nn.Graph input/output OpConf.BlobConf.is_dynamic #5767
  • [enhancement][bug][eager][op] fix argwhere 0-dim bug #5760
  • [enhancement][eager][op] delete unused code #5744
  • [feature][op] Export fused_scale_tril op #5933
  • [bug][op] Fix backward bug in 3d #5908
  • [bug][op] Fix one_hot api limit #5927
  • [enhancement][eager][op] Dev where scalar #5797
  • [bug][op] fix grad error #5914
  • [feature][bug][op] Fix inplace op circle reference bug #5910
  • [enhancement][op] Move the judgment content to c++, And add scalar fmod #5854
  • [enhancement][op] Support combined_margin_loss op in flow.nn.modules #5830
  • [enhancement][op][api][interface] functional_one_hot #5315
  • [enhancement][op] Dev scalar op #5778
  • [bug][eager][op] fix gather kernel 0 shape #5888
  • [enhancement][op] add l2_normalize for mutl-client interfaces #5859
  • [feature][op] Export function softmax_cross_entropy #6056
  • [enhancement][op] Add int attr for functional adaptive average pool #6059
  • [enhancement][op][interface] dev full op #5955
  • [bug][eager][op] fix 0dim inplace add #6029
  • [feature][op][system][interface] Feat: nn.Graph image gpu decoder #6014
  • [enhancement][op][interface] dev optim_optim_lr_scheduler_multisteplr #5975
  • [enhancement][op] NopKernel #6035
  • [enhancement][eager][op][api] Dev tril op #6005
  • [enhancement][op] dev unfold and fold #5675
  • [enhancement][op] ResNet CUDA Graphs #6018
  • [enhancement][feature][op] add broadcast pow #6013
  • [enhancement][op][interface] init of op diag #5298
  • [op][documentation][api] Fix api document bug #6009
  • [enhancement][op] Dev fused functional #5954
  • [bug][op][build] Add nvcc flag -Werror cross-execution-space-call #6002
  • [bug][op] Fix Normalization grad function #5993
  • [enhancement][feature][eager][op][test][interface] Add fused self attention #5966
  • [enhancement][bug][ci][eager][op][api][interface] Try to fix var bug #5973
  • [enhancement][feature][eager][op][interface] add prod op #5867
  • [enhancement][eager][op][api] add glu op #6065
  • [enhancement][op] Align Torch.nn.functional poolXd #6184
  • [bug][eager][op] fix backward index for gamma beta #6149
  • [bug][op][system] Fix BroadcastMatmulGrad bug #6168
  • [enhancement][op][api] Add Int support for functional.avg/maxpool #6174
  • [bug][eager][op][api][interface] align dropout api name with pytorch #6170
  • [enhancement][op] support inplace operation for hardsigmoid #6137
  • [enhancement][bug][op] Fix do bias correction in Adam/AdamW #5960
  • [bug][eager][op][api][interface] fix repeat 0-dim tensor bug #6150
  • [enhancement][bug][op] Fix select_first_grad bug #6142
  • [bug][ci][eager][op][documentation][interface] Add clipgrad doc and contiguous #6130
  • [bug][op] Fix eager optim dynamic attr bug #6111
  • [enhancement][op] Support grid_sample and affine_grid operator #6038
  • [op][documentation] Export apis for documentation #6068
  • [enhancement][feature][bug][ci][eager][op][documentation][interface] transfer python function to c++ method #6114
  • [op][documentation] Dev functional batch_gather #6233
  • [enhancement][op][test] fix cross_entropy_loss and its test #5799
  • [bug][op] Use attr nd_sbp to check consistent #6222
  • [enhancement][op] Dev fused bn functional #6077
  • [enhancement][op] support default value in intlist #6201
  • [bug][op] fix sparse_softmax get_nd_sbp #6203
  • [bug][op] Fix bug in model fused update #6197
  • [enhancement][op][system][refactor] Optimize tensor getitem. #5433

New in Eager:

  • [enhancement][eager][interface] Reconstruct module files #5251
  • [bug][eager][documentation][interface] Fix conv module bug #5245
  • [bug][ci][eager][interface] Fix bce withlogitloss ci error #5237
  • [feature][eager][api][interface] module BCELoss #5144
  • [enhancement][feature][eager][api][interface] Dev norm op #5178
  • [enhancement][bug][eager] Fix stack module #5222
  • [enhancement][feature][eager] Support different dtype of equal module #5214
  • [enhancement][bug][eager][documentation][api][interface] Add nllloss backward #5210
  • [enhancement][eager][api][upload-core] Decouple FileSystem and IOConf #5162
  • [enhancement][ci][eager] Set lower precision avoid ci failing #5200
  • [eager][documentation] Add hint when apply FunctionNode second time #5369
  • [enhancement][feature][bug][ci][eager][documentation][api] Fix upsample bilinear bug #5366
  • [bug][eager] Fix not contiguous ndarray to tensor bug #5351
  • [enhancement][eager][system] Infer consistent tensor meta #5118
  • [feature][eager] Feat graph autograd engine #5296
  • [enhancement][eager][interface] Dev type as module #5349
  • [feature][eager][documentation][api][interface] Add new ones module #5342
  • [enhancement][bug][eager] Fix logical slice assign dtype #5339
  • [bug][ci][eager][documentation][api][interface] Fix where module bug #5300
  • [bug][ci][eager][documentation][api] Fix l1loss ci error #5307
  • [enhancement][bug][eager][documentation][api][interface] Qi's First Edit of deleting "print" and ".numpy" #5129
  • [feature][eager][refactor] Separate autograd meta to tensor #5267
  • [feature][eager][api][interface] add tile module #5234
  • [enhancement][eager] Release lambda function to reuse tensor memory #5266
  • [feature][bug][eager][documentation] Fix default value not set bug #5483
  • [enhancement][eager][interface] [Add] gather_nd scatter_nd #5422
  • [enhancement][bug][eager] fix param #5473
  • [bug][eager] Fix Tensor.grad setter bug #5462
  • [enhancement][eager] Rename now_grad_arg to current_grad #5466
  • [eager][test][documentation][interface] Add autotest part1 #5436
  • [enhancement][eager] Use functional copy instead of op_builder #5460
  • [bottleneck][bug][eager][interface] fix -1 index not support bug #5448
  • [bug][ci][eager][documentation][api] Fix concat backward bug #5443
  • [enhancement][bug][ci][eager] Add autograd engine warning #5444
  • [feature][eager][api][interface] Smoothl1loss #5256
  • [enhancement][bottleneck][eager] remove device dtype params #5434
  • [bug][ci][eager][documentation][interface] Delete maxpool failed test #5409
  • [enhancement][eager][api] Add tensor grad assginment #5379
  • [enhancement][bug][eager] fix-abs #5398
  • [enhancement][bug][eager][interface] Fix bn track running stats #5393
  • [enhancement][bug][eager] Support uint dtype of constant op #5396
  • [enhancement][bug][eager][documentation][interface] Delete useless code upsample #5392
  • [enhancement][ci][eager][interface] add flow.view #5301
  • [enhancement][bug][ci][eager][api][interface] Add masked select module #5356
  • [bug][eager][interface] Fix batchnorm backward bug #5602
  • [enhancement][eager] Support weight_dacay(l2 actually) #5587
  • [feature][eager][documentation][api] Add new autotest #5588
  • [enhancement][eager][documentation][api] Dev fmod #5404
  • [feature][eager] Support inplace add #5432
  • [feature][eager][interface] Feat tensor stride property #5543
  • [enhancement][feature][eager][documentation][api] Add flip module #5541
  • [feature][eager] Feat module repr #5486
  • [enhancement][bottleneck][bug][eager][interface] Fix maxpool1d params #5493
  • [enhancement][feature][eager][interface] Dev flow.utils.data part1 #5406
  • [bug][eager][api] Fix tensor getitem bug #5474
  • [enhancement][eager][need-simple-ci] export datasets interface #5691
  • [enhancement][eager][system] rebase #5601
  • [enhancement][eager][test] added nn.RecordBytesDecoder with its test #5475
  • [enhancement][feature][eager][need-simple-ci] 0-dim tensor support #5552
  • [enhancement][bug][eager] rewrite slice_update backward #5677
  • [enhancement][bug][eager][interface] align view input style with torch #5676
  • [enhancement][eager][interface][need-simple-ci] add autotests for modules #5666
  • [enhancement][bottleneck][eager][interface] Dev constantpad1d op #5579
  • [enhancement][eager][api][interface] Restruct MathOps AutoTest #5654
  • [enhancement][bug][ci][eager] Fix flip bug #5657
  • [bug][eager][api][interface] Fix expand module bug #5650
  • [enhancement][bug][eager][documentation][api] Fix repeat bug #5633
  • [enhancement][eager][test][api][interface] Add new autotest #5617
  • [enhancement][eager][api][interface] Dev flow.utils.data part2 #5500
  • [enhancement][bug][eager] make setitem device match #5835
  • [bug][eager][api][interface] align reshape input param with pytorch #5804
  • [feature][bug][eager][api] Align where op with torch #5850
  • [enhancement][bug][eager][api] Restruct prelu op #5829
  • [bug][eager][need-simple-ci] fix pooling ceil_mode bug #5818
  • [enhancement][eager] stateful local kernel supports consistent #5789
  • [bug][eager][api][interface] Fix argwhere bug #5816
  • [enhancement][eager][documentation][api] dev-nonzero #5809
  • [enhancement][feature][eager][api] Add fake quantize op #5690
  • [enhancement][bug][eager][documentation][api] Add api #5663
  • [enhancement][eager] Refactor consistent infer result #5790
  • [bug][eager][need-simple-ci] skip dataloader test #5780
  • [bug][eager][need-simple-ci] fix 0-dim tensor.fill_ #5771
  • [enhancement][eager] Cpu mpi broadcast #5726
  • [feature][eager] Feat grad mode classes #5956
  • [enhancement][bug][eager] fix wrong names #5951
  • [enhancement][eager][system] Local dep object pool #5953
  • [enhancement][eager][interface] rename OpExprInterpState to AutoGradCaptureState #5918
  • [bug][eager] Fix linear bug #5945
  • [bug][eager] Fix tensor_meta update bug #5924
  • [enhancement][eager] use flow.randperm #5928
  • [enhancement][eager] consistent init/save/load #5896
  • [enhancement][bug][eager][documentation][interface] Restruct sort and argsort op #5911
  • [enhancement][bug][eager][interface] Try to fix the problem that the insightface cannot converge。 #5906
  • [enhancement][bug][eager][interface] Add autotest #5899
  • [enhancement][eager] The scheduler thread joins worker threads #5893
  • [enhancement][eager] Bugfix async callback #5881
  • [feature][eager] Feat tensor to bool #5836
  • [bug][eager] Remove inplace broadcast_add #5551
  • [enhancement][eager] Broadcast consistent shape and dtype #5784
  • [enhancement][eager] Fix optimizer list parameters input bug #5848
  • [enhancement][eager][interface] Dev flow.utils.data part3 #5644
  • [enhancement][eager][api] Normalize naming of modules #6066
  • [enhancement][feature][eager][api][interface] add truncnormal #6051
  • [enhancement][bug][eager] AutoMatedTest support test module.parameter.grad #6043
  • [enhancement][feature][bug][eager] add module call kwags #6069
  • [enhancement][eager][api][interface] add tensor.item tensor.tolist #6021
  • [enhancement][eager][api][interface] Export pool ops api #6047
  • [enhancement][bug][eager][test][documentation][interface] Add more autotest sample #6039
  • [enhancement][bug][eager][system] disable cuda_h2d stream #6020
  • [feature][eager][test][api][interface] Add autotest codegen #6019
  • [feature][eager][documentation] Refactor cosine lr scheduler #6000
  • [enhancement][eager][interface] tensor.cpu/tensor.cuda #5894
  • [enhancement][eager][api] Support consistent_tensor.to(dtype) #5991
  • [bug][eager][interface] remove redundant codes in ModuleDict #5961
  • [bug][eager] Fix LayerNorm check bug #6196
  • [enhancement][eager][api] Change dropout api #6182
  • [enhancement][good for pr][eager][api][interface] add: test convert dependency #6023
  • [enhancement][bug][eager][interface] Fix autotest codegen bug #6171
  • [bug][eager] restore instr_local_dep_object_pool_size for nccl #6160
  • [enhancement][eager][api][interface] Aligin pooling op functional api names with torch #6163
  • [feature][bug][eager][api][interface] delete file #6162
  • [bug][eager] Fix optim load_state_dict bug #6152
  • [enhancement][eager][api] add is_training to dropout functor #6148
  • [enhancement][eager] Decompose nd sbp boxing #5800
  • [enhancement][eager] support consistent_tensor.to(copy=True) #6122
  • [feature][eager] Static grad scaler #6135
  • [bug][eager] Fix LayerNorm expr bug #6121
  • [bug][eager][api] move numpy c api init in numpy.cpp, make np array contiguous before copying #6117
  • [enhancement][eager][refactor] Remove params from ParamGroup getitem #6096
  • [enhancement][feature][eager] Support tensor and optimizer serialization #6087
  • [enhancement][bug][eager] fix bug about tensor str in nonsymmetric cast and getitem in consist… #6239
  • [enhancement][eager] Cpu all reduce #5849
  • [feature][eager] Support assign copy interface #6228
  • [enhancement][eager][api][interface] Dev reconstruct pad ops #6223
  • [enhancement][eager][api][interface] support flow.cuda.is_available #6124
  • [bug][eager] make flow._C.local_all_reduce sync lanuched #6175
  • [enhancement][eager] Rename flow to oneflow in user hint #6190
  • [bug][eager][tooling][test][api][interface] Autotest generate input tensor #6206
  • [enhancement][eager] consistent tensor zeros_() #6202
  • [enhancement][eager] Cpu mpi #5865

Build enhancements:

  • [bug][build] Fix GRPC compilation failure on CMake 3.20 #5255
  • [bug][build] Refine header file copy #5254
  • [bug][build] Fix older version CMake doesn't support multiple targets in CLI #5248
  • [bug][build] Turn off NCCL_STATIC/CUDNN_STATIC when CUDA_STATIC is OFF #5243
  • [feature][build] Fix support for Ninja and add Ninja build in Simple CI #5236
  • [enhancement][build] Add cmake option CUDA_STATIC #5164
  • [bug][build] Fix protobuf debug postfix #5233
  • [enhancement][ci][build] Move default third party dir into build dir #5230
  • [enhancement][build] Refine protobuf cmake #5216
  • [enhancement][ci][build] Remove transport test main #5215
  • [enhancement][ci][build] Speedup opencv build #5213
  • [enhancement][build] Support clang #5015
  • [enhancement][documentation][build] Add prefix when creating git archive #5201
  • [enhancement][build] Add cmake option NCCL_STATIC #5160
  • [enhancement][build] Refine CMake CUDA version handling #5192
  • [enhancement][build] Use clang plugin to check Maybe variables are used #5358
  • [enhancement][build] Add BUILD_BYPRODUCTS for ExternalProject_Add #5316
  • [enhancement][build] Add cmake init cache to simplify user onboarding #5311
  • [feature][bug][build] Fix macOS support and run macOS build in Simple CI #4947
  • [enhancement][build] flatbuffers use mirror #5295
  • [enhancement][build] Don't build test by default #5302
  • [enhancement][build] Prevent building from scratch when toggle flag BUILD_GIT_VERSION #5259
  • [enhancement][build] Refine gRPC, glog, gflags cmake for conda #5276
  • [feature][build] Support XLA with CPU-only #5260
  • [enhancement][ci][onnx][build] Remove ONNX from CI #5257
  • [enhancement][build] Refactor build_wheel to support oneflowinc images #5427
  • [enhancement][build] Add arg skip_audit in build wheel #5423
  • [bug][build] hwloc disable shared #5388
  • [documentation][build] Update readme for autoconf and libtool #5376
  • [enhancement][build] remove dir python and compatible_single_client_python #5609
  • [bug][build][system] Fix pyyaml version #5594
  • [enhancement][ci][build] force release flags #5574
  • [bug][build] prevent endless loop #5534
  • [enhancement][build] Support sccache #5528
  • [enhancement][build] Add definition for CMAKE_BUILD_TYPE and print cmake_build_type in oneflow doctor #5505
  • [enhancement][ci][build][need-simple-ci] Fix macOS for recent changes #5705
  • [bug][build] fix return type error on gcc 4.8.5 #5660
  • [enhancement][build] Check CMAKE_BUILD_TYPE #5656
  • [enhancement][build] add -Werror=return-type #5655
  • [enhancement][build] Clean and fix for new py dir #5618
  • [enhancement][build] cmake: disable array-bounds check & treat warnings as errors for pyextobj and oneflow_internal & fix warnings #5838
  • [bug][build] set CMAKE_BUILD_TYPE to Release if undefined #5842
  • [enhancement][build][need-simple-ci] Fix all warnings & Add option TREAT_WARING_AS_ERROR to cmake #5751
  • [enhancement][build] add CMAKE_INTERPROCEDURAL_OPTIMIZATION in fast cmake cache #5970
  • [enhancement][build] add clang tidy target #5957
  • [bug][build] cmake: fix cmake cache args in opencv #5959
  • [enhancement][build] Add cmake option USE_SYSTEM_NCCL #5897
  • [enhancement][build] cmake: include third party headers as system headers to avoid warnings #5879
  • [enhancement][build] Ignore opencv-python on machine aarch64 #5884
  • [enhancement][build] enable CMake first class cuda support #5858
  • [bug][build] Fix compile warning (strict-aliasing) #5872
  • [enhancement][bug][build][need-simple-ci] Upgrade gtest and fix some errors raised by clang #6079
  • [bug][ci][build] cmake: fix ninja build in CI #6072
  • [bug][build] fix files not actually removed when building for multiple python versions #6060
  • [bug][build][api] functional_api: fix build error in mac os #6010
  • [bug][build][need-simple-ci][need-single-client-tests] Fix recompile from scratch #6036
  • [bug][build] Turn on NVCC's warnings #6011
  • [bug][build][need-single-client-tests] fix bundle .so of other python version #6034
  • [bug][ci][build][need-single-client-tests] use copy_all_files_in_dir to replace copy_files #6033
  • [enhancement][build] check compiler version in cmake #6026
  • [enhancement][build] Add CUDA_NVCC_THREADS_NUMBER #6017
  • [enhancement][build][need-simple-ci] optimize of_include_copy #5978
  • [enhancement][ci][build][need-single-client-tests] CI: remove -DTREAT_WARNINGS_AS_ERRORS=OFF #6008
  • [enhancement][build][xla] xrt: fix all warnings #5915
  • [enhancement][build] Prevent opencv compile failure with std 17 #5997
  • [enhancement][build] Use bundled cub #5998
  • [enhancement][ci][build] update clang tidy diff warnings-as-errors option #5989
  • [enhancement][build] Update run_clang_tidy.py to set return code and add warning-as-errors #5977
  • [enhancement][build] check: fix clang-tidy-diff commands #5972
  • [bug][build] Suppress NVCC warning #177-D #6094

XLA enhancements:

  • [bug][xla] Make the blob header memory aligned. #5286

System:

  • [enhancement][system] Refactor Memory Zone #5072
  • [enhancement][system] Add interface InferContext::OutputTensorDesc #5219
  • [bug][system] Lazy construct functor to make sure that the operators has already been registered. #5225
  • [enhancement][system] Refactor infer ctx output isdynamic #5220
  • [enhancement][system] Refactor infer ctx input isdynamic #5211
  • [enhancement][system] Wake up the heartbeat thread immediately #5081
  • [enhancement][system] Fix xla test case fail #5203
  • [enhancement][system] Add interface InferContext::InputDType #5153
  • [purge][system] delete const_cast in Output #5196
  • [feature][system] Add hwloc for topology detection #5291
  • [enhancement][system] fix registry may segment #5336
  • [enhancement][system] Use functional api instead of op_expr_helper::XXXOp. #5364
  • [enhancement][system] move btob to op #5274
  • [documentation][system] Add Latest News section in README #5361
  • [enhancement][bug][system] fix dropout module: return directly if not training #5346
  • [bug][system] add missing JUST #5357
  • [documentation][system] Add more communication outlets on README #5359
  • [enhancement][feature][system] CommNet dynamic register memory #5281
  • [enhancement][system] Use symbol device #5341
  • [enhancement][system] fix multithread bug in env #5283
  • [bug][system][api] fix bug in cfg_replacement #5335
  • [bug][system] Fix create log directory thread-unsafe #5326
  • [bug][system] fix_bug_in_make_parallel #5328
  • [enhancement][system][cfg] replace train_conf, job_conf using cfg::xx #5263
  • [enhancement][system][quantization] support tensorrt in qat #5287
  • [enhancement][system][api] Export functional apis for oneflow.experimental. #5313
  • [enhancement][system] fix bug check between cfg enum and proto enum #5285
  • [enhancement][system] replace CHECK_EQ using CHECK_EQ_OR_RETURN #5279
  • [enhancement][system] Refactor SbpXXX to cfg::SbpXXX #5120
  • [enhancement][system][api] add detach for LazyMirroredtensorImpl #5270
  • [enhancement][system] shorten XXIsDynamic4ArgNameAndIndex to be xxIsDynamic #5265
  • [enhancement][system][cfg] job_config to cfg #5235
  • [feature][system] Multi-Client LogicalRun degenerate to PhysicalRun #5479
  • [enhancement][system] fix ConstructOp without JUST #5480
  • [enhancement][system] Output arg modifier return maybe part 1 #5451
  • [feature][system][interface] Fea/nn graph/graph build ctx #5420
  • [enhancement][system] Throw exception if check failed #5457
  • [feature][system] multi client launch #5372
  • [enhancement][system][api] Optimize reduce mean #5452
  • [enhancement][system] export Tensor only to python #5440
  • [enhancement][system] Output arg modifier return maybe part_0 #5447
  • [enhancement][system] ThreadMgr support AddPlan #5450
  • [enhancement][system] Refactor infer ctx input tensordesc #5226
  • [enhancement][system][api] instruction builder return maybe #5442
  • [feature][system][interface] MultiClientSessionContext #5421
  • [enhancement][feature][system] add launcher, update multi client launch and exit #5414
  • [purge][system][refactor] Remove IOConf #5419
  • [enhancement][system] Dev refine generator #5426
  • [enhancement][system] Support inplace operations #5204
  • [enhancement][system][refactor] Dev refactor generator #5397
  • [enhancement][system] Add new placement init func #5408
  • [enhancement][system] NNGraphIf #5387
  • [enhancement][system][refactor] Cast explicitily in unpack call to avoid confilt with Optional. #5380
  • [enhancement][system][interface] [Random Generator] Part2: Migrate functional dropout #5378
  • [enhancement][system] replace ForeignJobInstance using JobInstance #5374
  • [enhancement][system][refactor] Speedup reshape module by 5x. #5381
  • [feature][system][interface] [Random Generator] Part1: Dev random generator #5360
  • [enhancement][system] Add ONEFLOW_STREAM_CUDA_EVENT_FLAG_BLOCKING_SYNC #5612
  • [enhancement][system] [part2]Remove singleclient outdated api #5568
  • [feature][system][interface] nn.Graph call and launch impl #5580
  • [enhancement][system] remove outdated doctest api and "@experimental_api" #5564
  • [feature][system][interface] Register ForeignCallback and Watcher in Multi-Client #5591
  • [enhancement][system] [Part-1]remove outdated api and files of multi-client on master branch #5556
  • [feature][system][interface] LazyInterpret build LocalTensor if input is local #5582
  • [enhancement][system] add job_pass MultiClientAutoSourceAndSinkTick #5507
  • [feature][system] Fea/nn graph/optimizer #5533
  • [feature][system][interface] New/CloseRuntimeBuffers and RunLazyJob impl #5571
  • [feature][system][refactor][interface] NNGraph interface and implement for CompileAndRuntime #5558
  • [feature][system] Fea/nn graph/forward graph #5516
  • [enhancement][system] Lazy job stream type #5389
  • [enhancement][system] Refactor single client autotick #5506
  • [enhancement][system] replace underline using dot in single client #5547
  • [bug][system] fix return type #5548
  • [feature][system][interface] LazyInterpret for UserOpExpr #5544
  • [enhancement][system] Add ProfilerStart/ProfilerStop API #5542
  • [feature][system][interface] LazyInterpreter for FetchOutputOpExpr and set op parallel_distribution #5527
  • [enhancement][system] Multi client push pull #5492
  • [enhancement][system] registry_callback_fn return maybe #5456
  • [enhancement][system] bw_gen_fn return maybe #5455
  • [enhancement][system] gen_bw_fn return maybe #5454
  • [enhancement][system] Compatible single client #5417
  • [feature][system][interface] GlobalMultiClientEnv and refine EagerExecution #5523
  • [enhancement][system] Job pass maybe system #5503
  • [enhancement][system] Remove Plan::net_topo #5502
  • [feature][system][interface] LazyInterpret for FeedVariableOpExpr #5490
  • [enhancement][system] Input arg modifier return maybe #5453
  • [feature][system][interface] Fea/nn graph/block scope #5498
  • [feature][system] jit_fuse_cast_scale #5332
  • [enhancement][system] Remove obsolete Profiler #5747
  • [enhancement][system][api] Dev fix batch norm not stats #5733
  • [enhancement][system] rename rpc_token to TransportToken #5735
  • [enhancement][system][api] Refacotr maximum minimum py2cpp #5724
  • [enhancement][system] Replace piece_id with comm_net_sequence_number #5731
  • [enhancement][system] beautify stack frame #5686
  • [enhancement][system] Add env ONEFLOW_KERNEL_DISABLE_BLOB_ACCESS_CHECKER #5728
  • [enhancement][system] Add env ONEFLOW_THREAD_ENABLE_LOCAL_MESSAGE_QUEUE #5720
  • [enhancement][system][api][refactor] Refactor functional sub, mul and div apis #5713
  • [feature][system] ddp #5008
  • [enhancement][system][api][refactor] Refactor functional matmul and add apis. #5697
  • [bug][system] Fix ClearKV("plan") #5710
  • [enhancement][system] Rename cpu to async cpu #5712
  • [enhancement][system] Support tensor.to()/to_local() #5271
  • [feature][system][refactor][interface] Multi-Runtime for multi nn.Graph #5683
  • [bug][system][refactor] Add tag for Optional inplace constructor #5619
  • [enhancement][system] Move Global to env scope #5670
  • [enhancement][system] add JUST wrapper #5681
  • [enhancement][system] New sync consistent meta info #5634
  • [enhancement][system][refactor][interface] Refactor RuntimeCtx for multi-runtime #5664
  • [feature][system][interface] Feat: memory shared between EagerTensor with VariableRegst #5649
  • [enhancement][system] Use functional call directly instead of construct a module and then call-Add #5613
  • [enhancement][system] disable eager_op consistent mode #5647
  • [enhancement][system] add msg_penddin_list in ibverbs_qp to optimize qp_init_attr.cap.max_send_wr #5485
  • [enhancement][system] IBVerbsCommNet add knobs #5626
  • [enhancement][system] Prune python tensor #5596
  • [feature][system][interface] Feat: LazyInterpret infer op / tensor ParallelDescScope #5625
  • [enhancement][system] Replace src tick with with wait and send ids #5603
  • [enhancement][system] Support symbol placement type in functional. #5627
  • [enhancement][system][api][refactor][interface] Dev advanced indexing #5559
  • [enhancement][system] Optimize maybe. #5839
  • [enhancement][system] Decorator 4 disable recursive boxing call #5796
  • [enhancement][system] add_eager_boxing_and_op_interpreter_dispatch_error_info #5819
  • [enhancement][system] Kernel CUDA Graphs Support #5725
  • [bug][system] Fix placement print bug #5853
  • [bug][system] when error msg formatting fails, return error->DebugString #5844
  • [enhancement][system][refactor] Rename variables named *parallel_distribution* to *nd_sbp* (1) #5815
  • [feature][system][interface] Support Free EagerTensor caught in nn.Graph build #5777
  • [enhancement][system] Reuse CUDA event / Refine BnInOp2Blob / Refine channel #5837
  • [enhancement][system][serving] fix bug in AddInputOutputOpsPass: check existence of key in HashMap(inferface_lbi2scope_sym_id) #5653
  • [enhancement][system][api] unpack_call: impl new unpack_call_dispatcher for better performance #5820
  • [feature][system] Feat consistent tensor python constructor #5812
  • [feature][system] Support 0shape tensor #5620
  • [documentation][system] fix launcher description #5770
  • [feature][system][interface] Multi-nn.Graph memory reuse by Chunk manager #5658
  • [bug][system] Fix naive b2p error #5806
  • [enhancement][system] set created generator with default rng seed #5801
  • [enhancement][system] enhance_local_to_consistent #5761
  • [feature][system] add flow.randn #5736
  • [enhancement][system] Refactor hierarchical parallel cast autograd #5764
  • [enhancement][system] Collective boxing executor add_plan delete_plan #5495
  • [enhancement][system] Fix throw abort #5795
  • [enhancement][system] DECORATE #5794
  • [enhancement][system] Inferface eager boxing #5682
  • [enhancement][system] extract_consistent_to_consistent_op_expr #5870
  • [enhancement][system] disable backward pass consistent tensor meta check. #5871
  • [enhancement][system] Add CudaStreamIndexGenerator::GenerateNamedStreamIndex #5940
  • [bug][system] Only query PCI bus id when CUDA version >= 11 #5937
  • [enhancement][system] maybe: add JUST_MSG and CHECK_JUST_MSG #5904
  • [bug][system] Fix bug scalar #5950
  • [enhancement][system] framework: fix rvalue reference warnings #5948
  • [purge][system] Remove CudaWorkType #5942
  • [enhancement][system] refactor_symbol #5941
  • [bug][system] consistent_tensor_infer_cache: fix memory leak #5938
  • [feature][system] support to print gpu #5936
  • [enhancement][system] Bugfix static check #5935
  • [bug][system] fix nccl_version log #5934
  • [bug][system] Fix bug of multi-GPU train nn.Graph extra mem cost in rank 0 #5930
  • [enhancement][system] Only gradient acc be scheduled in parallel. #5926
  • [enhancement][bug][system] fix_ddp_bug_on_8_process #5929
  • [enhancement][system] Fix bug error msg format #5866
  • [feature][system] print consistent tensor data #5902
  • [bug][system] Move parse env to the constructor #5922
  • [enhancement][system] Remove GlobalWorkStreamId/GlobalThrdId #5917
  • [bug][system] shared_or_scalar: fix alias warnings #5916
  • [purge][system] Remove CompActor #5919
  • [enhancement][system] Use symbol dtype #5641
  • [enhancement][feature][system] Control Graph / Session / Env's python c++ object destruction #5845
  • [enhancement][bug][system] Sync access and assign indexing tensor. #5907
  • [enhancement][system][api][refactor] Dev consistent arange #5883
  • [enhancement][system] Lazy interpreter for new ConsistentToConsistentOpExpr #5903
  • [bug][system] Fix BUG of LazyInterpret FreeEagerTensor memory shared with regst #5891
  • [bug][system] fix typo in raise RuntimeError #5890
  • [enhancement][system][refactor] Rename the ParallelDistribution class to NdSbp #5814
  • [feature][system] add flow.rand #5722
  • [feature][system] Lazy Interpret support infer default device cpu #5880
  • [enhancement][system] Tensor str #5783
  • [feature][system][interface] Lazy to_consistent #5774
  • [enhancement][system] wait vm empty before exiting #5860
  • [enhancement][system] Eager boxing n to 1 #5949
  • [enhancement][system] add kernel observer #6052
  • [enhancement][ci][system] Optimize ddp broadcast and add speed/memory test in ci #6044
  • [enhancement][system] add var to control only print warning once when blocked #6045
  • [enhancement][system][refactor] Rewrite pow and logical functional apis #6032
  • [enhancement][system] Token seq id #5964
  • [enhancement][documentation][system] Remove python function wrapper. #6012
  • [feature][system] Add timeout and loc for blocking calls #6007
  • [enhancement][system] Eager boxing 1 to n #5943
  • [enhancement][system] Boxing expr #6015
  • [enhancement][system] new_X_to_B #5987
  • [enhancement][system] Add unimplemented return information #5952
  • [enhancement][system] Revert "Faster decorator" #6006
  • [enhancement][system] Throw exception if using advanced indexing for tensor setitem #6001
  • [enhancement][system] Support eager boxing sm 2 sn #5869
  • [enhancement][system] Move framework/local_dep_object.* to the eager directory #5988
  • [enhancement][system] Fix builtin op arg tuple. #5464
  • [feature][system][refactor] Dev functional multiple signatures #5982
  • [enhancement][system] Faster decorator #5996
  • [enhancement][system] Placed nd sbp #5995
  • [feature][system] Support asymmetric input/output/variable tensors in nn.Graph #5983
  • [enhancement][system] LightActor #5868
  • [bug][system] Prevent running oneflow in forked subprocess #5976
  • [bug][system] common/error: fix build error in mac os #5971
  • [bug][system] fix_bug_test_tensor_str #5958
  • [enhancement][system] Refine StreamContext #6191
  • [enhancement][system] container_util: fix VectorAt, remove useless MutMapAt #6172
  • [enhancement][system] Typesafe KernelState #6198
  • [enhancement][system] Primitive based copy task node #6195
  • [feature][system][interface] Lazy support Scalar #6181
  • [enhancement][system] Disable implicit boxing when parallel num eq one #6188
  • [enhancement][system] Primitive #6183
  • [enhancement][system] Remove IDMgr::GetGpuPhyIdFromThrdId/IDMgr::GetDeviceTypeFromThrdId #6169
  • [enhancement][system] remove op_expr_helper inside gradient_funcs #6057
  • [feature][system][api] Add tensor yaml, support export tensor functional api. #6099
  • [feature][system] Plan memory log #6151
  • [feature][system] Add dtype bfloat16 #5304
  • [enhancement][system] StreamContext #6129
  • [bug][system] Fix wrong inplace acc grad #6146
  • [enhancement][system] UserKernel remove job_desc #6144
  • [enhancement][system][api] Fea/graph/add outputs buffer to enable pipeline #6126
  • [enhancement][system] not fuse request for nccl 2.10.3 #6136
  • [bug][system] NewUniqueId thread safe #6141
  • [enhancement][system] XRT remove job_desc #6139
  • [enhancement][system] SystemOpFillJobNamePass #6138
  • [enhancement][system] mv_boxing_folder_to_core #6140
  • [enhancement][system] Refactor boxing interpreter to boxing expr #6134
  • [enhancement][system] Eager boxing one to one #6048
  • [enhancement][system] Vm cpu efficiency #6110
  • [enhancement][system] Naive generic boxing #6116
  • [feature][system] send/recv #5992
  • [enhancement][system] disable_print_stack_in_tensor_numpy #6123
  • [feature][system] add all_reduce by to_consistent #5963
  • [enhancement][system] KernelContext #6084
  • [enhancement][bug][system] Fix sync nccl and async nccl deadlock #6071
  • [bug][system][refactor] Refactor to local #6098
  • [enhancement][system] Replace xor with hash combine (part 1) #6078
  • [enhancement][system] Optimize error message #6073
  • [enhancement][system] Rename Error::xx to Error::xxError #6049
  • [enhancement][system] send formatted msg to glog #5999
  • [feature][bottleneck][bug][system][interface] [Feat.] NNGraph new eager tensor for new variable created in JobPass #6091
  • [bug][system] Fix bug of multi-GPU eager copy D2H extra mem cost in rank 0 #6092
  • [enhancement][system][api] Rename module flow.F to flow._C #6053
  • [feature][system][interface] [Feat.] Eager consistent OFRecordReader #6089
  • [enhancement][system][api] Dev fix and align interface #6075
  • [feature][bottleneck][bug][system][interface] NNGraph input/output valid by register tensors #6240
  • [bug][system][interface] Fix bug of Multi-Client src tick output order #6221
  • [enhancement][bug][system] Add cast primitive #6234
  • [feature][bottleneck][system][interface] Auto FixPipelineStageIdPass #6204
  • [enhancement][system] move scalar to oneflow namespace. #6235
  • [enhancement][system] UserKernel init CUDA Graphs with state #6230
  • [feature][system] Comm broadcast #6213
  • [enhancement][system][refactor] Rename opname to optype_name in AutogradEngine #6154
  • [enhancement][system] Add memset primitive #6218
  • [enhancement][system] Add StreamContext::device_type()/DeviceCtx::device_type() #6217
  • [feature][system] add all_gather and fix bug of multi rank doctest #6189
  • [feature][system][interface] [Feat.] Lazy interpreter skip hierarchical_parallel_cast #6208
  • [purge][system] Cleanup KernelUtil #6212
  • [enhancement][system] StreamContextAdapter #6205
  • [enhancement][system] Dev eliminate gcc warnings #6199
  • [feature][bottleneck][system][interface] [Feat.] nn.Graph support grad acc with input/output tensor #6155
  • [enhancement][system] Cpu symetric s to s #6153
  • [enhancement][system][upload-core] Op expr infer tensor meta #5064
  • [enhancement][system] Infer consistent tensor meta #5362

CI enhancements:

  • [bug][ci][api][interface] Refine module test #5232
  • [enhancement][ci] Add Simple CI, runs CPU-only on GitHub hosted servers #5207
  • [enhancement][ci] Run exe test in CPU-only #5202
  • [enhancement][ci] Cancel all workflow runs but the latest #5206
  • [enhancement][ci] Fix master not running Simple CI #5368
  • [enhancement][ci] Refine Simple CI and Clang analysis #5367
  • [enhancement][feature][bug][ci][documentation][interface] Fix upsample bilinear bug #5363
  • [enhancement][ci] Build nightly for py39 #5318
  • [enhancement][ci] Try distributed run for 3 times to prevent failure #5305
  • [enhancement][ci] Upload Simple CI logs to cloud #5268
  • [enhancement][ci] Remove cpu_op_eager and cuda_op_eager #5470
  • [bug][ci] fix segfault in clang plugin #5437
  • [enhancement][ci] Refine Simple CI error output #5435
  • [enhancement][ci] Add conda env to Simple CI #5385
  • [enhancement][ci] Fix clang plugin core file not found #5390
  • [bug][ci] upload core when build with clang plugin #5384
  • [bug][ci] clang plugin skip more files #5373
  • [enhancement][ci] Use gh-action-scheduler-v2 #5370
  • [enhancement][ci] relax speed threshold #5569
  • [bug][ci] Fix wrong test path under compatible #5567
  • [enhancement][ci][need-simple-ci] Prevent upload logs automatically #5560
  • [enhancement][ci][interface] Add nn.AdaptiveAvgPool1d and nn.AdaptiveAvgPool3d #5445
  • [feature][ci] add speed test in ci #5496
  • [enhancement][ci] Reduce usage of Simple CI #5546
  • [feature][bug][ci][api] Restruct upsample module #5524
  • [feature][ci] multi client launcher test #5488
  • [enhancement][ci] Remove automerge if cuda_new_interface failed #5519
  • [enhancement][ci] Prevent adding subdir in python/test #5514
  • [enhancement][ci] piprepo->pipindex #5517
  • [enhancement][ci] add dynamic_loss_scale in ci tests #5337
  • [enhancement][ci] Add timeout for wait_gpu_slot #5497
  • [enhancement][feature][ci] new static check based on clang-tidy #5476
  • [enhancement][ci] Fix url not downloadable in some browers #5701
  • [feature][ci] multi client multi machine test #5685
  • [enhancement][ci] Add cpu new interface CI #5639
  • [enhancement][ci][need-simple-ci] Mv clangtidy to simple ci #5667
  • [enhancement][ci][need-simple-ci] use clang tidy appimage in ci #5841
  • [enhancement][ci] Use gcc 7 in release to prevent error #5840
  • [enhancement][ci] bn tol 1e-4 => 1e-3 #5811
  • [enhancement][ci] fix distributed run on built dir #5810
  • [enhancement][ci] fix third party mirror check_sum #5802
  • [ci][documentation] find more accurately which files need to be doctested #5782
  • [enhancement][ci] Print stack unconditionally #5779
  • [enhancement][ci][need-simple-ci] Enable more checkers for clang-tidy in CI #5738
  • [enhancement][ci] CI: add clang-tidy check to test.yaml #5920
  • [ci][documentation] fix docstring in oneflow.nn.functional namespace #5807
  • [enhancement][ci] disable TREAT_WARNINGS_AS_ERRORS in Release CI #5886
  • [enhancement][ci] Skip ci jobs by git diff #5863
  • [bug][ci] quick fix #5978 #6030
  • [enhancement][bug][ci] fix clang tidy diff options and file format #5990
  • [enhancement][ci] add flow.relu #5847
  • [enhancement][ci] equal => allclose #6164
  • [bug][ci][need-simple-ci] CI: fix clang tidy checks in simple ci #6161
  • [enhancement][bug][ci][documentation][api] add interpolate and layer_norm docs #6157
  • [bug][ci] update speed test #6113
  • [enhancement][bug][ci][documentation][api] speed import oneflow #6107
  • [bug][ci] Also try install dev deps for CODEGEN_PYTHON_EXECUTABLE #6115
  • [bug][ci][need-simple-ci] set gtest_CMAKE_DEBUG_POSTFIX "d" #6085
  • [enhancement][ci] add cache init file for clang and CI build with clang #6062
  • [enhancement][ci] add emoji in speed test output, make it continue-on-error #6214

Test enhancements:

  • [bug][test][interface] Fix acos ci bug #5217
  • [feature][test] implement automated test #5321
  • [enhancement][test] move generator test into ops folder to accelerate tests #5472
  • [feature][test][api] Add autotest part2 #5467
  • [enhancement][test][api][interface] Add some tests with the new framework for auto testing #5561
  • [bug][test] fix test error when do multi case test on graph #5590
  • [enhancement][test] Refine module test using auto test by yaochi #5484
  • [enhancement][test] Add autotest for BatchNorm2d #5734
  • [enhancement][test] RTH_update_op_test #5823
  • [enhancement][test] dev adamw graph config #5745
  • [feature][test][api][interface] Add new autotest #5562
  • [bug][test] restore test of alexnet graph #5798
  • [enhancement][test][interface] add zhangshen op-test #5600
  • [feature][bug][tooling][test][interface] Record autotest wrong code #5923
  • [enhancement][feature][test][api] add randint #5718
  • [bug][test] fix multi machine test #5984
  • [enhancement][test][interface] some op test #6095

Tooling enhancements:

  • [bug][tooling] user/summary: fix memory leak in FillImageInSummary #5742
  • [enhancement][tooling][cfg] cfg: add move assignment operator for performance #5962
  • [enhancement][tooling][api][refactor] refactor_all_device_placement_api #6080
oneflow - v0.5rc2

Published by jackalcooper about 3 years ago

Changelog

v0.5rc2 (28/09/2021)

Highlights

  • First class support for eager execution. The deprecated APIs are moved to oneflow.compatible.single_client
  • Drop-in replacement of import torch for existing Pytorch projects. You could test it by inter-changing import oneflow as torch and import torch as flow.
  • nn.Module for eager execution
  • nn.Graph for lazy execution
  • DDP for data parallel

A sneak peek of the new API

Here is a minimum example showcasing how to incorporate a nn.Module in a nn.Graph and have it run in lazy mode.

class NeuralGraph(flow.nn.Graph):
    def __init__(self, ...):
        super().__init__()
        self.model = model # model is a nn.Module instance

    def build(self, x):
        y_pred = self.model(x)
        return y_pred

graph = NeuralGraph() # to create a nn.Graph instance
y_pred = graph(x) # to run the created nn.Graph

New in Python API

  • [feature][eager][op][test][python][interface] Add test for convtranspose2d #5239
  • [enhancement][python][interface] Add GroupNorm #5175
  • [enhancement][eager][python][interface] [Add] avgpool1d avgpool3d #5165
  • [feature][eager][op][python][interface] Add deconv cpu impl #5224
  • [bug][eager][api][python][interface] Fix acosh bug #5221
  • [feature][eager][op][python][interface] Dev modules ctc loss #5168
  • [bottleneck][bug][documentation][python][interface] Fix meshgrid test bug #5208
  • [eager][documentation][python][interface] Rename CosineScheduler to CosineAnnealingLR #5112
  • [feature][eager][python][interface] Add meshgrid module #5205
  • [enhancement][feature][bug][op][python] support bias in conv2d's parameter list #5322
  • [eager][documentation][api][python][interface] add not_equal, greater_equal and less_equal module #5350
  • [enhancement][eager][python] refine pow module and its test #5319
  • [enhancement][eager][op][python] Add triu op #5329
  • [enhancement][bug][python] Fix optimizer for not supporting all kinds of iterables #5355
  • [bug][python][interface] raise IndexError in get_canonical_index to support for loop #5345
  • [bug][python][interface] tensor slice assign supports broadcasting #5344
  • [enhancement][op][python] add cpu group conv logic #5314
  • [enhancement][python] Add 'nn.Mish' module and corresponding functions #5310
  • [enhancement][build][python] Remove ONNX from setup py #5297
  • [enhancement][python][interface] [add] zeropad2d #5278
  • [feature][system][python][interface] Lazy nn.Graph FeedInputOpExpr #5458
  • [feature][python][interface] integrate nn.image.flip #5411
  • [bug][python] Fix issues in point of MultiClientSession #5469
  • [enhancement][bug][python] update HasAllMultiClientEnvVars() #5459
  • [enhancement][python] Add in_top_k function #5428
  • [enhancement][python] Dev add docstring #5449
  • [feature][api][python] MultiClientSession #5407
  • [documentation][python] remove --user #5431
  • [feature][python][interface] nn.Graph python #5309
  • [feature][python][interface] Fea/nn graph/graph name #5413
  • [bug][python][interface] rm nn.Graph.train #5424
  • [op][documentation][api][python][interface] add bernoulli module #5353
  • [enhancement][python] flow.S/B/P #5306
  • [enhancement][documentation][python] Add instruction on upgrade pip #5400
  • [enhancement][python] Rm oneflow export and experimental #5589
  • [bug][python] Fix nn.graph.utils module conflict #5598
  • [feature][ci][python] Update autotest framework #5520
  • [enhancement][python] copy of_proto_python_dir to compatible_single_client_python #5539
  • [enhancement][api][python] del default env init #5537
  • [enhancement][python] Fix single client using same glog file #5535
  • [bug][api][python] Fix Session TryClose #5531
  • [enhancement][feature][python] split vector-matrix norm #5478
  • [feature][eager][op][python][interface] Add more upsample kernel #5382
  • [enhancement][feature][test][python] add torchstyle unittest #5489
  • [feature][system][python] nn.Graph with training #5662
  • [enhancement][feature][python] Fea/nn graph/block proxy func #5727
  • [enhancement][api][python] consistent_tensor_to_api #5703
  • [feature][eager][op][python] Dev Align torch avgpool #5610
  • [enhancement][python] fix circular deps of sbp python module #5706
  • [documentation][python] [part5]Remove singleclient outdated api #5674
  • [enhancement][python] [part4]Remove singleclient outdated api #5672
  • [bug][op][python] remove outdated code in conv3d #5696
  • [enhancement][test][python] enlarge tolerance of dataloader test #5689
  • [enhancement][test][python] add autotest for some math ops #5646
  • [feature][python] nn.Graph optimizer part 2: add L2, pass job complete, refactor #5604
  • [enhancement][python] Add clip_grad_norm #5299
  • [purge][python] Remove Single-Client API in oneflow default python #5827
  • [bug][python] Fix ddp grad size #5834
  • [enhancement][feature][python] Dev RMSprop graph conf #5768
  • [enhancement][purge][eager][python] remove scale arg in optimizer #5821
  • [enhancement][feature][python] graph/block io check #5803
  • [enhancement][feature][python] Dev adam graph conf #5709
  • [purge][python] [part10]Remove singleclient outdated api #5756
  • [feature][api][python] better repr of nn.Graph for debug #5762
  • [bug][python] fix weight decay in RMSprop #5755
  • [purge][python] [part9]Remove singleclient outdated api #5752
  • [purge][python] [part8]Remove singleclient outdated api #5750
  • [documentation][python] add first batch of methods in oneflow.nn.functional namespace #5693
  • [purge][python] [part6]Remove singleclient outdated api #5704
  • [bug][python] use default_generator.seed() as random_seed in init #5721
  • [bug][system][python] ddp broadcast params and buffers #5913
  • [enhancement][test][python] Add consistent tensor requires grad test #5925
  • [bug][python] wrap flow.nn.init.* with flow.no_grad() #5932
  • [feature][api][python][interface] add clip_grad to optimizer #5817
  • [enhancement][ci][op][test][python] add randperm with test and docs #5680
  • [feature][api][python] Fea/nn graph/ lr_schedule(and cosine lr_sch) and opt_group #5846
  • [bug][python] fix bug of SyncOnMasterFn atexit #5909
  • [purge][python] Delete single client nn modules #6061
  • [enhancement][python] Move framework.distribute to env #6022
  • [bug][python] skip sync when abnormally exiting #6025
  • [feature][python] Fea/nn graph/warmup amp config #5969
  • [documentation][python] add optimizer api docs #6131
  • [documentation][python] add_tensor_api_doc #6127
  • [bug][python] Fix test_grid_sample.py and test_affine_grid.py threshold #6125
  • [documentation][api][python] add doc of graph #6093
  • [bug][python] Fix make of_format fail in ubuntu #6120
  • [feature][api][python][interface] Fea/graph helpers #6088
  • [enhancement][eager][python][interface] Use flow.randint in dataloader #6086
  • [feature][eager][api][python][interface] Import oneflow as torch #6076
  • [enhancement][test][api][python][refactor] rename OfrecordReader to OFRcordReader #6090
  • [purge][python][need-single-client-tests] Delete single client nn modules #6082
  • [enhancement][python] flow.load tolerates FileNotFound fault #6083
  • [feature][python] Fea/pipeline in graph #6105
  • [enhancement][test][python] graph activation checkpointing #6192
  • [enhancement][feature][op][python] rnn test #6165

New in Ops:

  • [enhancement][op][api][refactor] [Functional] Part2: Add partial unary and math functional apis #5218
  • [enhancement][bug][op][interface] Refine deconv kernel #5229
  • [enhancement][op][api][interface] add ReflectionPad2d #5172
  • [feature][eager][op][api][interface] crossentropyloss and nllloss support ignore_index #5195
  • [feature][eager][op][api][interface] Yejiaojiao/dev bcewithlogitsloss #5173
  • [bug][ci][op] Dev user op set default is_dynamic #5223
  • [enhancement][op] add magic method for pow #5199
  • [enhancement][op][interface] add cpu version of upsampling #5194
  • [enhancement][bug][op][api][interface] add ReplicationPad2d #5148
  • [feature][eager][op][api][interface] add kldivloss module #5155
  • [feature][eager][op][documentation][build][api][interface] Add floor module and the corresponding testcases #4964
  • [enhancement][feature][op] Dev conv1d module #5280
  • [enhancement][op] Add ctc_greedy_decoder op #5294
  • [enhancement][op][system] Dev remove default grad func #5320
  • [enhancement][op][system] Add pad grad func. #5354
  • [enhancement][op][system] Add gradient funcs. #5348
  • [feature][purge][bug][eager][op][interface] fix upsample nearest bug #5347
  • [enhancement][op][system] [Functional] Part7: Migrate pooling ops #5253
  • [enhancement][op] nvjpeg hardware acc #5240
  • [enhancement][feature][ci][eager][op][api][interface] Add bmm module #5334
  • [enhancement][eager][op] Dev image decode eager #5333
  • [enhancement][op] Optimize softmax warp impl #4977
  • [enhancement][eager][op] Dev tensor buffer eager #5317
  • [enhancement][op][api][refactor] [Functional] Part6: Migrate conv op #5252
  • [enhancement][eager][op] Dev sort eager #5284
  • [enhancement][bug][op][api] fix bceloss bug in default weight and reduction #5303
  • [bug][eager][op] remove redundant assert and check #5264
  • [enhancement][bug][ci][op] fix bceloss bug about weight #5269
  • [enhancement][op][api][refactor] [Functional] Part5: Migrate nn ops #5249
  • [enhancement][eager][op] Dev argsort eager #5273
  • [enhancement][op][api][refactor] [Functional] Part4: Migrate array ops #5247
  • [enhancement][op][api][refactor] [Functional] Part3: Migrate binary and activation ops #5246
  • [bug][ci][op][test] Dev fix rmsprop ci fail #5481
  • [enhancement][op] add inplace method: Tensor.sin_ #5471
  • [bug][op] hotfix image_batch_align #5461
  • [enhancement][eager][op][interface] Dev maxpool series op 123d #5244
  • [bug][op] fix pool gpu kernel #5446
  • [feature][eager][op][api][interface] add pixelshufflev2 module #5383
  • [enhancement][feature][ci][eager][op][documentation][api][interface] Add flow xxx and tensor xxx autotest #5386
  • [enhancement][feature][eager][op][api][interface] Modules chunk #5324
  • [enhancement][eager][op] add image normalize for eager #5402
  • [enhancement][eager][op] Dev batch align module #5401
  • [enhancement][eager][op] add coco reader module #5391
  • [enhancement][wip][op] Restruct Elementwise kernel #4130
  • [bug][op] Fix DecodeRandom reuse mem #5606
  • [enhancement][op] Align pytorch maxpool #5525
  • [enhancement][bottleneck][eager][op][api] implementation of constantpad-3d op #5529
  • [enhancement][eager][op] Add scale size for resize #5509
  • [enhancement][op][api][refactor] Dev optimize tensor setitem #5501
  • [enhancement][op] register uint8 dtypeto support dataloader #5499
  • [enhancement][op] Add unique.cuh #5487
  • [enhancement][op][api][interface] Dev ofrecord auto truncating #5412
  • [feature][op][system][interface] Feat: LazyInterpret::ApplyImpl support SourceUserOpExpr and Copy #5711
  • [enhancement][op][interface] Dev logical_and/or modules #5636
  • [enhancement][op] support any number positional arguments for ones and zeros op #5698
  • [enhancement][feature][eager][op] Add conv3d Module #5327
  • [feature][eager][op][api][interface] add batchnorm3d module #5631
  • [bug][eager][op] fix reduce min max backward bug #5651
  • [enhancement][op] Debug dim scatter #5371
  • [enhancement][op][interface] Dev eye #5583
  • [enhancement][eager][op] Dev minimum maximum #5576
  • [enhancement][op] Restruct activation grad op #5669
  • [enhancement][feature][eager][op] Rewrite activation function #5465
  • [bug][op][documentation] add oneflow.cat for documentation #5621
  • [enhancement][op] Lcy logsoftmax #5746
  • [feature][op][need-simple-ci] Feat empty op #5659
  • [enhancement][eager][op] Dev split #5714
  • [enhancement][op][interface] add index_select op #5661
  • [bug][op] fix nvjpeg hw acc #5851
  • [enhancement][op] Remove move in conv_cudnn #5828
  • [enhancement][op][interface] Dev logical_xor module #5694
  • [bug][eager][op] fix squeeze #5808
  • [enhancement][op] Get parallel_id and parallel_num through rank and world size in DDP #5717
  • [bug][eager][op] delete interpolate int type #5805
  • [bug][op] Fix bug in scatter #5743
  • [enhancement][op] Refactor: remove module not required, call function directly #5754
  • [enhancement][op] Remove modules not required(tan, erfc, log1p, scatter_nd) #5791
  • [enhancement][op] Refactor scatter, clamp and pow in cpp instead of in python #5715
  • [enhancement][op] Rm useless code in gather files #5687
  • [enhancement][eager][op] change flip_code to scalar #5786
  • [enhancement][bug][op][interface] fix upsample bug #5753
  • [bug][op][interface] Quick fix Lazy nn.Graph input/output OpConf.BlobConf.is_dynamic #5767
  • [enhancement][bug][eager][op] fix argwhere 0-dim bug #5760
  • [enhancement][eager][op] delete unused code #5744
  • [feature][op] Export fused_scale_tril op #5933
  • [bug][op] Fix backward bug in 3d #5908
  • [bug][op] Fix one_hot api limit #5927
  • [enhancement][eager][op] Dev where scalar #5797
  • [bug][op] fix grad error #5914
  • [feature][bug][op] Fix inplace op circle reference bug #5910
  • [enhancement][op] Move the judgment content to c++, And add scalar fmod #5854
  • [enhancement][op] Support combined_margin_loss op in flow.nn.modules #5830
  • [enhancement][op][api][interface] functional_one_hot #5315
  • [enhancement][op] Dev scalar op #5778
  • [bug][eager][op] fix gather kernel 0 shape #5888
  • [enhancement][op] add l2_normalize for mutl-client interfaces #5859
  • [feature][op] Export function softmax_cross_entropy #6056
  • [enhancement][op] Add int attr for functional adaptive average pool #6059
  • [enhancement][op][interface] dev full op #5955
  • [bug][eager][op] fix 0dim inplace add #6029
  • [feature][op][system][interface] Feat: nn.Graph image gpu decoder #6014
  • [enhancement][op][interface] dev optim_optim_lr_scheduler_multisteplr #5975
  • [enhancement][op] NopKernel #6035
  • [enhancement][eager][op][api] Dev tril op #6005
  • [enhancement][op] dev unfold and fold #5675
  • [enhancement][op] ResNet CUDA Graphs #6018
  • [enhancement][feature][op] add broadcast pow #6013
  • [enhancement][op][interface] init of op diag #5298
  • [op][documentation][api] Fix api document bug #6009
  • [enhancement][op] Dev fused functional #5954
  • [bug][op][build] Add nvcc flag -Werror cross-execution-space-call #6002
  • [bug][op] Fix Normalization grad function #5993
  • [enhancement][feature][eager][op][test][interface] Add fused self attention #5966
  • [enhancement][bug][ci][eager][op][api][interface] Try to fix var bug #5973
  • [enhancement][feature][eager][op][interface] add prod op #5867
  • [enhancement][eager][op][api] add glu op #6065
  • [enhancement][op] Align Torch.nn.functional poolXd #6184
  • [bug][eager][op] fix backward index for gamma beta #6149
  • [bug][op][system] Fix BroadcastMatmulGrad bug #6168
  • [enhancement][op][api] Add Int support for functional.avg/maxpool #6174
  • [bug][eager][op][api][interface] align dropout api name with pytorch #6170
  • [enhancement][op] support inplace operation for hardsigmoid #6137
  • [enhancement][bug][op] Fix do bias correction in Adam/AdamW #5960
  • [bug][eager][op][api][interface] fix repeat 0-dim tensor bug #6150
  • [enhancement][bug][op] Fix select_first_grad bug #6142
  • [bug][ci][eager][op][documentation][interface] Add clipgrad doc and contiguous #6130
  • [bug][op] Fix eager optim dynamic attr bug #6111
  • [enhancement][op] Support grid_sample and affine_grid operator #6038
  • [op][documentation] Export apis for documentation #6068
  • [enhancement][feature][bug][ci][eager][op][documentation][interface] transfer python function to c++ method #6114
  • [op][documentation] Dev functional batch_gather #6233
  • [enhancement][op][test] fix cross_entropy_loss and its test #5799
  • [bug][op] Use attr nd_sbp to check consistent #6222
  • [enhancement][op] Dev fused bn functional #6077
  • [enhancement][op] support default value in intlist #6201
  • [bug][op] fix sparse_softmax get_nd_sbp #6203
  • [bug][op] Fix bug in model fused update #6197
  • [enhancement][op][system][refactor] Optimize tensor getitem. #5433

New in Eager:

  • [enhancement][eager][interface] Reconstruct module files #5251
  • [bug][eager][documentation][interface] Fix conv module bug #5245
  • [bug][ci][eager][interface] Fix bce withlogitloss ci error #5237
  • [feature][eager][api][interface] module BCELoss #5144
  • [enhancement][feature][eager][api][interface] Dev norm op #5178
  • [enhancement][bug][eager] Fix stack module #5222
  • [enhancement][feature][eager] Support different dtype of equal module #5214
  • [enhancement][bug][eager][documentation][api][interface] Add nllloss backward #5210
  • [enhancement][eager][api][upload-core] Decouple FileSystem and IOConf #5162
  • [enhancement][ci][eager] Set lower precision avoid ci failing #5200
  • [eager][documentation] Add hint when apply FunctionNode second time #5369
  • [enhancement][feature][bug][ci][eager][documentation][api] Fix upsample bilinear bug #5366
  • [bug][eager] Fix not contiguous ndarray to tensor bug #5351
  • [enhancement][eager][system] Infer consistent tensor meta #5118
  • [feature][eager] Feat graph autograd engine #5296
  • [enhancement][eager][interface] Dev type as module #5349
  • [feature][eager][documentation][api][interface] Add new ones module #5342
  • [enhancement][bug][eager] Fix logical slice assign dtype #5339
  • [bug][ci][eager][documentation][api][interface] Fix where module bug #5300
  • [bug][ci][eager][documentation][api] Fix l1loss ci error #5307
  • [enhancement][bug][eager][documentation][api][interface] Qi's First Edit of deleting "print" and ".numpy" #5129
  • [feature][eager][refactor] Separate autograd meta to tensor #5267
  • [feature][eager][api][interface] add tile module #5234
  • [enhancement][eager] Release lambda function to reuse tensor memory #5266
  • [feature][bug][eager][documentation] Fix default value not set bug #5483
  • [enhancement][eager][interface] [Add] gather_nd scatter_nd #5422
  • [enhancement][bug][eager] fix param #5473
  • [bug][eager] Fix Tensor.grad setter bug #5462
  • [enhancement][eager] Rename now_grad_arg to current_grad #5466
  • [eager][test][documentation][interface] Add autotest part1 #5436
  • [enhancement][eager] Use functional copy instead of op_builder #5460
  • [bottleneck][bug][eager][interface] fix -1 index not support bug #5448
  • [bug][ci][eager][documentation][api] Fix concat backward bug #5443
  • [enhancement][bug][ci][eager] Add autograd engine warning #5444
  • [feature][eager][api][interface] Smoothl1loss #5256
  • [enhancement][bottleneck][eager] remove device dtype params #5434
  • [bug][ci][eager][documentation][interface] Delete maxpool failed test #5409
  • [enhancement][eager][api] Add tensor grad assginment #5379
  • [enhancement][bug][eager] fix-abs #5398
  • [enhancement][bug][eager][interface] Fix bn track running stats #5393
  • [enhancement][bug][eager] Support uint dtype of constant op #5396
  • [enhancement][bug][eager][documentation][interface] Delete useless code upsample #5392
  • [enhancement][ci][eager][interface] add flow.view #5301
  • [enhancement][bug][ci][eager][api][interface] Add masked select module #5356
  • [bug][eager][interface] Fix batchnorm backward bug #5602
  • [enhancement][eager] Support weight_dacay(l2 actually) #5587
  • [feature][eager][documentation][api] Add new autotest #5588
  • [enhancement][eager][documentation][api] Dev fmod #5404
  • [feature][eager] Support inplace add #5432
  • [feature][eager][interface] Feat tensor stride property #5543
  • [enhancement][feature][eager][documentation][api] Add flip module #5541
  • [feature][eager] Feat module repr #5486
  • [enhancement][bottleneck][bug][eager][interface] Fix maxpool1d params #5493
  • [enhancement][feature][eager][interface] Dev flow.utils.data part1 #5406
  • [bug][eager][api] Fix tensor getitem bug #5474
  • [enhancement][eager][need-simple-ci] export datasets interface #5691
  • [enhancement][eager][system] rebase #5601
  • [enhancement][eager][test] added nn.RecordBytesDecoder with its test #5475
  • [enhancement][feature][eager][need-simple-ci] 0-dim tensor support #5552
  • [enhancement][bug][eager] rewrite slice_update backward #5677
  • [enhancement][bug][eager][interface] align view input style with torch #5676
  • [enhancement][eager][interface][need-simple-ci] add autotests for modules #5666
  • [enhancement][bottleneck][eager][interface] Dev constantpad1d op #5579
  • [enhancement][eager][api][interface] Restruct MathOps AutoTest #5654
  • [enhancement][bug][ci][eager] Fix flip bug #5657
  • [bug][eager][api][interface] Fix expand module bug #5650
  • [enhancement][bug][eager][documentation][api] Fix repeat bug #5633
  • [enhancement][eager][test][api][interface] Add new autotest #5617
  • [enhancement][eager][api][interface] Dev flow.utils.data part2 #5500
  • [enhancement][bug][eager] make setitem device match #5835
  • [bug][eager][api][interface] align reshape input param with pytorch #5804
  • [feature][bug][eager][api] Align where op with torch #5850
  • [enhancement][bug][eager][api] Restruct prelu op #5829
  • [bug][eager][need-simple-ci] fix pooling ceil_mode bug #5818
  • [enhancement][eager] stateful local kernel supports consistent #5789
  • [bug][eager][api][interface] Fix argwhere bug #5816
  • [enhancement][eager][documentation][api] dev-nonzero #5809
  • [enhancement][feature][eager][api] Add fake quantize op #5690
  • [enhancement][bug][eager][documentation][api] Add api #5663
  • [enhancement][eager] Refactor consistent infer result #5790
  • [bug][eager][need-simple-ci] skip dataloader test #5780
  • [bug][eager][need-simple-ci] fix 0-dim tensor.fill_ #5771
  • [enhancement][eager] Cpu mpi broadcast #5726
  • [feature][eager] Feat grad mode classes #5956
  • [enhancement][bug][eager] fix wrong names #5951
  • [enhancement][eager][system] Local dep object pool #5953
  • [enhancement][eager][interface] rename OpExprInterpState to AutoGradCaptureState #5918
  • [bug][eager] Fix linear bug #5945
  • [bug][eager] Fix tensor_meta update bug #5924
  • [enhancement][eager] use flow.randperm #5928
  • [enhancement][eager] consistent init/save/load #5896
  • [enhancement][bug][eager][documentation][interface] Restruct sort and argsort op #5911
  • [enhancement][bug][eager][interface] Try to fix the problem that the insightface cannot converge。 #5906
  • [enhancement][bug][eager][interface] Add autotest #5899
  • [enhancement][eager] The scheduler thread joins worker threads #5893
  • [enhancement][eager] Bugfix async callback #5881
  • [feature][eager] Feat tensor to bool #5836
  • [bug][eager] Remove inplace broadcast_add #5551
  • [enhancement][eager] Broadcast consistent shape and dtype #5784
  • [enhancement][eager] Fix optimizer list parameters input bug #5848
  • [enhancement][eager][interface] Dev flow.utils.data part3 #5644
  • [enhancement][eager][api] Normalize naming of modules #6066
  • [enhancement][feature][eager][api][interface] add truncnormal #6051
  • [enhancement][bug][eager] AutoMatedTest support test module.parameter.grad #6043
  • [enhancement][feature][bug][eager] add module call kwags #6069
  • [enhancement][eager][api][interface] add tensor.item tensor.tolist #6021
  • [enhancement][eager][api][interface] Export pool ops api #6047
  • [enhancement][bug][eager][test][documentation][interface] Add more autotest sample #6039
  • [enhancement][bug][eager][system] disable cuda_h2d stream #6020
  • [feature][eager][test][api][interface] Add autotest codegen #6019
  • [feature][eager][documentation] Refactor cosine lr scheduler #6000
  • [enhancement][eager][interface] tensor.cpu/tensor.cuda #5894
  • [enhancement][eager][api] Support consistent_tensor.to(dtype) #5991
  • [bug][eager][interface] remove redundant codes in ModuleDict #5961
  • [bug][eager] Fix LayerNorm check bug #6196
  • [enhancement][eager][api] Change dropout api #6182
  • [enhancement][good for pr][eager][api][interface] add: test convert dependency #6023
  • [enhancement][bug][eager][interface] Fix autotest codegen bug #6171
  • [bug][eager] restore instr_local_dep_object_pool_size for nccl #6160
  • [enhancement][eager][api][interface] Aligin pooling op functional api names with torch #6163
  • [feature][bug][eager][api][interface] delete file #6162
  • [bug][eager] Fix optim load_state_dict bug #6152
  • [enhancement][eager][api] add is_training to dropout functor #6148
  • [enhancement][eager] Decompose nd sbp boxing #5800
  • [enhancement][eager] support consistent_tensor.to(copy=True) #6122
  • [feature][eager] Static grad scaler #6135
  • [bug][eager] Fix LayerNorm expr bug #6121
  • [bug][eager][api] move numpy c api init in numpy.cpp, make np array contiguous before copying #6117
  • [enhancement][eager][refactor] Remove params from ParamGroup getitem #6096
  • [enhancement][feature][eager] Support tensor and optimizer serialization #6087
  • [enhancement][bug][eager] fix bug about tensor str in nonsymmetric cast and getitem in consist… #6239
  • [enhancement][eager] Cpu all reduce #5849
  • [feature][eager] Support assign copy interface #6228
  • [enhancement][eager][api][interface] Dev reconstruct pad ops #6223
  • [enhancement][eager][api][interface] support flow.cuda.is_available #6124
  • [bug][eager] make flow._C.local_all_reduce sync lanuched #6175
  • [enhancement][eager] Rename flow to oneflow in user hint #6190
  • [bug][eager][tooling][test][api][interface] Autotest generate input tensor #6206
  • [enhancement][eager] consistent tensor zeros_() #6202
  • [enhancement][eager] Cpu mpi #5865

Build enhancements:

  • [bug][build] Fix GRPC compilation failure on CMake 3.20 #5255
  • [bug][build] Refine header file copy #5254
  • [bug][build] Fix older version CMake doesn't support multiple targets in CLI #5248
  • [bug][build] Turn off NCCL_STATIC/CUDNN_STATIC when CUDA_STATIC is OFF #5243
  • [feature][build] Fix support for Ninja and add Ninja build in Simple CI #5236
  • [enhancement][build] Add cmake option CUDA_STATIC #5164
  • [bug][build] Fix protobuf debug postfix #5233
  • [enhancement][ci][build] Move default third party dir into build dir #5230
  • [enhancement][build] Refine protobuf cmake #5216
  • [enhancement][ci][build] Remove transport test main #5215
  • [enhancement][ci][build] Speedup opencv build #5213
  • [enhancement][build] Support clang #5015
  • [enhancement][documentation][build] Add prefix when creating git archive #5201
  • [enhancement][build] Add cmake option NCCL_STATIC #5160
  • [enhancement][build] Refine CMake CUDA version handling #5192
  • [enhancement][build] Use clang plugin to check Maybe variables are used #5358
  • [enhancement][build] Add BUILD_BYPRODUCTS for ExternalProject_Add #5316
  • [enhancement][build] Add cmake init cache to simplify user onboarding #5311
  • [feature][bug][build] Fix macOS support and run macOS build in Simple CI #4947
  • [enhancement][build] flatbuffers use mirror #5295
  • [enhancement][build] Don't build test by default #5302
  • [enhancement][build] Prevent building from scratch when toggle flag BUILD_GIT_VERSION #5259
  • [enhancement][build] Refine gRPC, glog, gflags cmake for conda #5276
  • [feature][build] Support XLA with CPU-only #5260
  • [enhancement][ci][onnx][build] Remove ONNX from CI #5257
  • [enhancement][build] Refactor build_wheel to support oneflowinc images #5427
  • [enhancement][build] Add arg skip_audit in build wheel #5423
  • [bug][build] hwloc disable shared #5388
  • [documentation][build] Update readme for autoconf and libtool #5376
  • [enhancement][build] remove dir python and compatible_single_client_python #5609
  • [bug][build][system] Fix pyyaml version #5594
  • [enhancement][ci][build] force release flags #5574
  • [bug][build] prevent endless loop #5534
  • [enhancement][build] Support sccache #5528
  • [enhancement][build] Add definition for CMAKE_BUILD_TYPE and print cmake_build_type in oneflow doctor #5505
  • [enhancement][ci][build][need-simple-ci] Fix macOS for recent changes #5705
  • [bug][build] fix return type error on gcc 4.8.5 #5660
  • [enhancement][build] Check CMAKE_BUILD_TYPE #5656
  • [enhancement][build] add -Werror=return-type #5655
  • [enhancement][build] Clean and fix for new py dir #5618
  • [enhancement][build] cmake: disable array-bounds check & treat warnings as errors for pyextobj and oneflow_internal & fix warnings #5838
  • [bug][build] set CMAKE_BUILD_TYPE to Release if undefined #5842
  • [enhancement][build][need-simple-ci] Fix all warnings & Add option TREAT_WARING_AS_ERROR to cmake #5751
  • [enhancement][build] add CMAKE_INTERPROCEDURAL_OPTIMIZATION in fast cmake cache #5970
  • [enhancement][build] add clang tidy target #5957
  • [bug][build] cmake: fix cmake cache args in opencv #5959
  • [enhancement][build] Add cmake option USE_SYSTEM_NCCL #5897
  • [enhancement][build] cmake: include third party headers as system headers to avoid warnings #5879
  • [enhancement][build] Ignore opencv-python on machine aarch64 #5884
  • [enhancement][build] enable CMake first class cuda support #5858
  • [bug][build] Fix compile warning (strict-aliasing) #5872
  • [enhancement][bug][build][need-simple-ci] Upgrade gtest and fix some errors raised by clang #6079
  • [bug][ci][build] cmake: fix ninja build in CI #6072
  • [bug][build] fix files not actually removed when building for multiple python versions #6060
  • [bug][build][api] functional_api: fix build error in mac os #6010
  • [bug][build][need-simple-ci][need-single-client-tests] Fix recompile from scratch #6036
  • [bug][build] Turn on NVCC's warnings #6011
  • [bug][build][need-single-client-tests] fix bundle .so of other python version #6034
  • [bug][ci][build][need-single-client-tests] use copy_all_files_in_dir to replace copy_files #6033
  • [enhancement][build] check compiler version in cmake #6026
  • [enhancement][build] Add CUDA_NVCC_THREADS_NUMBER #6017
  • [enhancement][build][need-simple-ci] optimize of_include_copy #5978
  • [enhancement][ci][build][need-single-client-tests] CI: remove -DTREAT_WARNINGS_AS_ERRORS=OFF #6008
  • [enhancement][build][xla] xrt: fix all warnings #5915
  • [enhancement][build] Prevent opencv compile failure with std 17 #5997
  • [enhancement][build] Use bundled cub #5998
  • [enhancement][ci][build] update clang tidy diff warnings-as-errors option #5989
  • [enhancement][build] Update run_clang_tidy.py to set return code and add warning-as-errors #5977
  • [enhancement][build] check: fix clang-tidy-diff commands #5972
  • [bug][build] Suppress NVCC warning #177-D #6094

XLA enhancements:

  • [bug][xla] Make the blob header memory aligned. #5286

System:

  • [enhancement][system] Refactor Memory Zone #5072
  • [enhancement][system] Add interface InferContext::OutputTensorDesc #5219
  • [bug][system] Lazy construct functor to make sure that the operators has already been registered. #5225
  • [enhancement][system] Refactor infer ctx output isdynamic #5220
  • [enhancement][system] Refactor infer ctx input isdynamic #5211
  • [enhancement][system] Wake up the heartbeat thread immediately #5081
  • [enhancement][system] Fix xla test case fail #5203
  • [enhancement][system] Add interface InferContext::InputDType #5153
  • [purge][system] delete const_cast in Output #5196
  • [feature][system] Add hwloc for topology detection #5291
  • [enhancement][system] fix registry may segment #5336
  • [enhancement][system] Use functional api instead of op_expr_helper::XXXOp. #5364
  • [enhancement][system] move btob to op #5274
  • [documentation][system] Add Latest News section in README #5361
  • [enhancement][bug][system] fix dropout module: return directly if not training #5346
  • [bug][system] add missing JUST #5357
  • [documentation][system] Add more communication outlets on README #5359
  • [enhancement][feature][system] CommNet dynamic register memory #5281
  • [enhancement][system] Use symbol device #5341
  • [enhancement][system] fix multithread bug in env #5283
  • [bug][system][api] fix bug in cfg_replacement #5335
  • [bug][system] Fix create log directory thread-unsafe #5326
  • [bug][system] fix_bug_in_make_parallel #5328
  • [enhancement][system][cfg] replace train_conf, job_conf using cfg::xx #5263
  • [enhancement][system][quantization] support tensorrt in qat #5287
  • [enhancement][system][api] Export functional apis for oneflow.experimental. #5313
  • [enhancement][system] fix bug check between cfg enum and proto enum #5285
  • [enhancement][system] replace CHECK_EQ using CHECK_EQ_OR_RETURN #5279
  • [enhancement][system] Refactor SbpXXX to cfg::SbpXXX #5120
  • [enhancement][system][api] add detach for LazyMirroredtensorImpl #5270
  • [enhancement][system] shorten XXIsDynamic4ArgNameAndIndex to be xxIsDynamic #5265
  • [enhancement][system][cfg] job_config to cfg #5235
  • [feature][system] Multi-Client LogicalRun degenerate to PhysicalRun #5479
  • [enhancement][system] fix ConstructOp without JUST #5480
  • [enhancement][system] Output arg modifier return maybe part 1 #5451
  • [feature][system][interface] Fea/nn graph/graph build ctx #5420
  • [enhancement][system] Throw exception if check failed #5457
  • [feature][system] multi client launch #5372
  • [enhancement][system][api] Optimize reduce mean #5452
  • [enhancement][system] export Tensor only to python #5440
  • [enhancement][system] Output arg modifier return maybe part_0 #5447
  • [enhancement][system] ThreadMgr support AddPlan #5450
  • [enhancement][system] Refactor infer ctx input tensordesc #5226
  • [enhancement][system][api] instruction builder return maybe #5442
  • [feature][system][interface] MultiClientSessionContext #5421
  • [enhancement][feature][system] add launcher, update multi client launch and exit #5414
  • [purge][system][refactor] Remove IOConf #5419
  • [enhancement][system] Dev refine generator #5426
  • [enhancement][system] Support inplace operations #5204
  • [enhancement][system][refactor] Dev refactor generator #5397
  • [enhancement][system] Add new placement init func #5408
  • [enhancement][system] NNGraphIf #5387
  • [enhancement][system][refactor] Cast explicitily in unpack call to avoid confilt with Optional. #5380
  • [enhancement][system][interface] [Random Generator] Part2: Migrate functional dropout #5378
  • [enhancement][system] replace ForeignJobInstance using JobInstance #5374
  • [enhancement][system][refactor] Speedup reshape module by 5x. #5381
  • [feature][system][interface] [Random Generator] Part1: Dev random generator #5360
  • [enhancement][system] Add ONEFLOW_STREAM_CUDA_EVENT_FLAG_BLOCKING_SYNC #5612
  • [enhancement][system] [part2]Remove singleclient outdated api #5568
  • [feature][system][interface] nn.Graph call and launch impl #5580
  • [enhancement][system] remove outdated doctest api and "@experimental_api" #5564
  • [feature][system][interface] Register ForeignCallback and Watcher in Multi-Client #5591
  • [enhancement][system] [Part-1]remove outdated api and files of multi-client on master branch #5556
  • [feature][system][interface] LazyInterpret build LocalTensor if input is local #5582
  • [enhancement][system] add job_pass MultiClientAutoSourceAndSinkTick #5507
  • [feature][system] Fea/nn graph/optimizer #5533
  • [feature][system][interface] New/CloseRuntimeBuffers and RunLazyJob impl #5571
  • [feature][system][refactor][interface] NNGraph interface and implement for CompileAndRuntime #5558
  • [feature][system] Fea/nn graph/forward graph #5516
  • [enhancement][system] Lazy job stream type #5389
  • [enhancement][system] Refactor single client autotick #5506
  • [enhancement][system] replace underline using dot in single client #5547
  • [bug][system] fix return type #5548
  • [feature][system][interface] LazyInterpret for UserOpExpr #5544
  • [enhancement][system] Add ProfilerStart/ProfilerStop API #5542
  • [feature][system][interface] LazyInterpreter for FetchOutputOpExpr and set op parallel_distribution #5527
  • [enhancement][system] Multi client push pull #5492
  • [enhancement][system] registry_callback_fn return maybe #5456
  • [enhancement][system] bw_gen_fn return maybe #5455
  • [enhancement][system] gen_bw_fn return maybe #5454
  • [enhancement][system] Compatible single client #5417
  • [feature][system][interface] GlobalMultiClientEnv and refine EagerExecution #5523
  • [enhancement][system] Job pass maybe system #5503
  • [enhancement][system] Remove Plan::net_topo #5502
  • [feature][system][interface] LazyInterpret for FeedVariableOpExpr #5490
  • [enhancement][system] Input arg modifier return maybe #5453
  • [feature][system][interface] Fea/nn graph/block scope #5498
  • [feature][system] jit_fuse_cast_scale #5332
  • [enhancement][system] Remove obsolete Profiler #5747
  • [enhancement][system][api] Dev fix batch norm not stats #5733
  • [enhancement][system] rename rpc_token to TransportToken #5735
  • [enhancement][system][api] Refacotr maximum minimum py2cpp #5724
  • [enhancement][system] Replace piece_id with comm_net_sequence_number #5731
  • [enhancement][system] beautify stack frame #5686
  • [enhancement][system] Add env ONEFLOW_KERNEL_DISABLE_BLOB_ACCESS_CHECKER #5728
  • [enhancement][system] Add env ONEFLOW_THREAD_ENABLE_LOCAL_MESSAGE_QUEUE #5720
  • [enhancement][system][api][refactor] Refactor functional sub, mul and div apis #5713
  • [feature][system] ddp #5008
  • [enhancement][system][api][refactor] Refactor functional matmul and add apis. #5697
  • [bug][system] Fix ClearKV("plan") #5710
  • [enhancement][system] Rename cpu to async cpu #5712
  • [enhancement][system] Support tensor.to()/to_local() #5271
  • [feature][system][refactor][interface] Multi-Runtime for multi nn.Graph #5683
  • [bug][system][refactor] Add tag for Optional inplace constructor #5619
  • [enhancement][system] Move Global to env scope #5670
  • [enhancement][system] add JUST wrapper #5681
  • [enhancement][system] New sync consistent meta info #5634
  • [enhancement][system][refactor][interface] Refactor RuntimeCtx for multi-runtime #5664
  • [feature][system][interface] Feat: memory shared between EagerTensor with VariableRegst #5649
  • [enhancement][system] Use functional call directly instead of construct a module and then call-Add #5613
  • [enhancement][system] disable eager_op consistent mode #5647
  • [enhancement][system] add msg_penddin_list in ibverbs_qp to optimize qp_init_attr.cap.max_send_wr #5485
  • [enhancement][system] IBVerbsCommNet add knobs #5626
  • [enhancement][system] Prune python tensor #5596
  • [feature][system][interface] Feat: LazyInterpret infer op / tensor ParallelDescScope #5625
  • [enhancement][system] Replace src tick with with wait and send ids #5603
  • [enhancement][system] Support symbol placement type in functional. #5627
  • [enhancement][system][api][refactor][interface] Dev advanced indexing #5559
  • [enhancement][system] Optimize maybe. #5839
  • [enhancement][system] Decorator 4 disable recursive boxing call #5796
  • [enhancement][system] add_eager_boxing_and_op_interpreter_dispatch_error_info #5819
  • [enhancement][system] Kernel CUDA Graphs Support #5725
  • [bug][system] Fix placement print bug #5853
  • [bug][system] when error msg formatting fails, return error->DebugString #5844
  • [enhancement][system][refactor] Rename variables named *parallel_distribution* to *nd_sbp* (1) #5815
  • [feature][system][interface] Support Free EagerTensor caught in nn.Graph build #5777
  • [enhancement][system] Reuse CUDA event / Refine BnInOp2Blob / Refine channel #5837
  • [enhancement][system][serving] fix bug in AddInputOutputOpsPass: check existence of key in HashMap(inferface_lbi2scope_sym_id) #5653
  • [enhancement][system][api] unpack_call: impl new unpack_call_dispatcher for better performance #5820
  • [feature][system] Feat consistent tensor python constructor #5812
  • [feature][system] Support 0shape tensor #5620
  • [documentation][system] fix launcher description #5770
  • [feature][system][interface] Multi-nn.Graph memory reuse by Chunk manager #5658
  • [bug][system] Fix naive b2p error #5806
  • [enhancement][system] set created generator with default rng seed #5801
  • [enhancement][system] enhance_local_to_consistent #5761
  • [feature][system] add flow.randn #5736
  • [enhancement][system] Refactor hierarchical parallel cast autograd #5764
  • [enhancement][system] Collective boxing executor add_plan delete_plan #5495
  • [enhancement][system] Fix throw abort #5795
  • [enhancement][system] DECORATE #5794
  • [enhancement][system] Inferface eager boxing #5682
  • [enhancement][system] extract_consistent_to_consistent_op_expr #5870
  • [enhancement][system] disable backward pass consistent tensor meta check. #5871
  • [enhancement][system] Add CudaStreamIndexGenerator::GenerateNamedStreamIndex #5940
  • [bug][system] Only query PCI bus id when CUDA version >= 11 #5937
  • [enhancement][system] maybe: add JUST_MSG and CHECK_JUST_MSG #5904
  • [bug][system] Fix bug scalar #5950
  • [enhancement][system] framework: fix rvalue reference warnings #5948
  • [purge][system] Remove CudaWorkType #5942
  • [enhancement][system] refactor_symbol #5941
  • [bug][system] consistent_tensor_infer_cache: fix memory leak #5938
  • [feature][system] support to print gpu #5936
  • [enhancement][system] Bugfix static check #5935
  • [bug][system] fix nccl_version log #5934
  • [bug][system] Fix bug of multi-GPU train nn.Graph extra mem cost in rank 0 #5930
  • [enhancement][system] Only gradient acc be scheduled in parallel. #5926
  • [enhancement][bug][system] fix_ddp_bug_on_8_process #5929
  • [enhancement][system] Fix bug error msg format #5866
  • [feature][system] print consistent tensor data #5902
  • [bug][system] Move parse env to the constructor #5922
  • [enhancement][system] Remove GlobalWorkStreamId/GlobalThrdId #5917
  • [bug][system] shared_or_scalar: fix alias warnings #5916
  • [purge][system] Remove CompActor #5919
  • [enhancement][system] Use symbol dtype #5641
  • [enhancement][feature][system] Control Graph / Session / Env's python c++ object destruction #5845
  • [enhancement][bug][system] Sync access and assign indexing tensor. #5907
  • [enhancement][system][api][refactor] Dev consistent arange #5883
  • [enhancement][system] Lazy interpreter for new ConsistentToConsistentOpExpr #5903
  • [bug][system] Fix BUG of LazyInterpret FreeEagerTensor memory shared with regst #5891
  • [bug][system] fix typo in raise RuntimeError #5890
  • [enhancement][system][refactor] Rename the ParallelDistribution class to NdSbp #5814
  • [feature][system] add flow.rand #5722
  • [feature][system] Lazy Interpret support infer default device cpu #5880
  • [enhancement][system] Tensor str #5783
  • [feature][system][interface] Lazy to_consistent #5774
  • [enhancement][system] wait vm empty before exiting #5860
  • [enhancement][system] Eager boxing n to 1 #5949
  • [enhancement][system] add kernel observer #6052
  • [enhancement][ci][system] Optimize ddp broadcast and add speed/memory test in ci #6044
  • [enhancement][system] add var to control only print warning once when blocked #6045
  • [enhancement][system][refactor] Rewrite pow and logical functional apis #6032
  • [enhancement][system] Token seq id #5964
  • [enhancement][documentation][system] Remove python function wrapper. #6012
  • [feature][system] Add timeout and loc for blocking calls #6007
  • [enhancement][system] Eager boxing 1 to n #5943
  • [enhancement][system] Boxing expr #6015
  • [enhancement][system] new_X_to_B #5987
  • [enhancement][system] Add unimplemented return information #5952
  • [enhancement][system] Revert "Faster decorator" #6006
  • [enhancement][system] Throw exception if using advanced indexing for tensor setitem #6001
  • [enhancement][system] Support eager boxing sm 2 sn #5869
  • [enhancement][system] Move framework/local_dep_object.* to the eager directory #5988
  • [enhancement][system] Fix builtin op arg tuple. #5464
  • [feature][system][refactor] Dev functional multiple signatures #5982
  • [enhancement][system] Faster decorator #5996
  • [enhancement][system] Placed nd sbp #5995
  • [feature][system] Support asymmetric input/output/variable tensors in nn.Graph #5983
  • [enhancement][system] LightActor #5868
  • [bug][system] Prevent running oneflow in forked subprocess #5976
  • [bug][system] common/error: fix build error in mac os #5971
  • [bug][system] fix_bug_test_tensor_str #5958
  • [enhancement][system] Refine StreamContext #6191
  • [enhancement][system] container_util: fix VectorAt, remove useless MutMapAt #6172
  • [enhancement][system] Typesafe KernelState #6198
  • [enhancement][system] Primitive based copy task node #6195
  • [feature][system][interface] Lazy support Scalar #6181
  • [enhancement][system] Disable implicit boxing when parallel num eq one #6188
  • [enhancement][system] Primitive #6183
  • [enhancement][system] Remove IDMgr::GetGpuPhyIdFromThrdId/IDMgr::GetDeviceTypeFromThrdId #6169
  • [enhancement][system] remove op_expr_helper inside gradient_funcs #6057
  • [feature][system][api] Add tensor yaml, support export tensor functional api. #6099
  • [feature][system] Plan memory log #6151
  • [feature][system] Add dtype bfloat16 #5304
  • [enhancement][system] StreamContext #6129
  • [bug][system] Fix wrong inplace acc grad #6146
  • [enhancement][system] UserKernel remove job_desc #6144
  • [enhancement][system][api] Fea/graph/add outputs buffer to enable pipeline #6126
  • [enhancement][system] not fuse request for nccl 2.10.3 #6136
  • [bug][system] NewUniqueId thread safe #6141
  • [enhancement][system] XRT remove job_desc #6139
  • [enhancement][system] SystemOpFillJobNamePass #6138
  • [enhancement][system] mv_boxing_folder_to_core #6140
  • [enhancement][system] Refactor boxing interpreter to boxing expr #6134
  • [enhancement][system] Eager boxing one to one #6048
  • [enhancement][system] Vm cpu efficiency #6110
  • [enhancement][system] Naive generic boxing #6116
  • [feature][system] send/recv #5992
  • [enhancement][system] disable_print_stack_in_tensor_numpy #6123
  • [feature][system] add all_reduce by to_consistent #5963
  • [enhancement][system] KernelContext #6084
  • [enhancement][bug][system] Fix sync nccl and async nccl deadlock #6071
  • [bug][system][refactor] Refactor to local #6098
  • [enhancement][system] Replace xor with hash combine (part 1) #6078
  • [enhancement][system] Optimize error message #6073
  • [enhancement][system] Rename Error::xx to Error::xxError #6049
  • [enhancement][system] send formatted msg to glog #5999
  • [feature][bottleneck][bug][system][interface] [Feat.] NNGraph new eager tensor for new variable created in JobPass #6091
  • [bug][system] Fix bug of multi-GPU eager copy D2H extra mem cost in rank 0 #6092
  • [enhancement][system][api] Rename module flow.F to flow._C #6053
  • [feature][system][interface] [Feat.] Eager consistent OFRecordReader #6089
  • [enhancement][system][api] Dev fix and align interface #6075
  • [feature][bottleneck][bug][system][interface] NNGraph input/output valid by register tensors #6240
  • [bug][system][interface] Fix bug of Multi-Client src tick output order #6221
  • [enhancement][bug][system] Add cast primitive #6234
  • [feature][bottleneck][system][interface] Auto FixPipelineStageIdPass #6204
  • [enhancement][system] move scalar to oneflow namespace. #6235
  • [enhancement][system] UserKernel init CUDA Graphs with state #6230
  • [feature][system] Comm broadcast #6213
  • [enhancement][system][refactor] Rename opname to optype_name in AutogradEngine #6154
  • [enhancement][system] Add memset primitive #6218
  • [enhancement][system] Add StreamContext::device_type()/DeviceCtx::device_type() #6217
  • [feature][system] add all_gather and fix bug of multi rank doctest #6189
  • [feature][system][interface] [Feat.] Lazy interpreter skip hierarchical_parallel_cast #6208
  • [purge][system] Cleanup KernelUtil #6212
  • [enhancement][system] StreamContextAdapter #6205
  • [enhancement][system] Dev eliminate gcc warnings #6199
  • [feature][bottleneck][system][interface] [Feat.] nn.Graph support grad acc with input/output tensor #6155
  • [enhancement][system] Cpu symetric s to s #6153
  • [enhancement][system][upload-core] Op expr infer tensor meta #5064
  • [enhancement][system] Infer consistent tensor meta #5362

CI enhancements:

  • [bug][ci][api][interface] Refine module test #5232
  • [enhancement][ci] Add Simple CI, runs CPU-only on GitHub hosted servers #5207
  • [enhancement][ci] Run exe test in CPU-only #5202
  • [enhancement][ci] Cancel all workflow runs but the latest #5206
  • [enhancement][ci] Fix master not running Simple CI #5368
  • [enhancement][ci] Refine Simple CI and Clang analysis #5367
  • [enhancement][feature][bug][ci][documentation][interface] Fix upsample bilinear bug #5363
  • [enhancement][ci] Build nightly for py39 #5318
  • [enhancement][ci] Try distributed run for 3 times to prevent failure #5305
  • [enhancement][ci] Upload Simple CI logs to cloud #5268
  • [enhancement][ci] Remove cpu_op_eager and cuda_op_eager #5470
  • [bug][ci] fix segfault in clang plugin #5437
  • [enhancement][ci] Refine Simple CI error output #5435
  • [enhancement][ci] Add conda env to Simple CI #5385
  • [enhancement][ci] Fix clang plugin core file not found #5390
  • [bug][ci] upload core when build with clang plugin #5384
  • [bug][ci] clang plugin skip more files #5373
  • [enhancement][ci] Use gh-action-scheduler-v2 #5370
  • [enhancement][ci] relax speed threshold #5569
  • [bug][ci] Fix wrong test path under compatible #5567
  • [enhancement][ci][need-simple-ci] Prevent upload logs automatically #5560
  • [enhancement][ci][interface] Add nn.AdaptiveAvgPool1d and nn.AdaptiveAvgPool3d #5445
  • [feature][ci] add speed test in ci #5496
  • [enhancement][ci] Reduce usage of Simple CI #5546
  • [feature][bug][ci][api] Restruct upsample module #5524
  • [feature][ci] multi client launcher test #5488
  • [enhancement][ci] Remove automerge if cuda_new_interface failed #5519
  • [enhancement][ci] Prevent adding subdir in python/test #5514
  • [enhancement][ci] piprepo->pipindex #5517
  • [enhancement][ci] add dynamic_loss_scale in ci tests #5337
  • [enhancement][ci] Add timeout for wait_gpu_slot #5497
  • [enhancement][feature][ci] new static check based on clang-tidy #5476
  • [enhancement][ci] Fix url not downloadable in some browers #5701
  • [feature][ci] multi client multi machine test #5685
  • [enhancement][ci] Add cpu new interface CI #5639
  • [enhancement][ci][need-simple-ci] Mv clangtidy to simple ci #5667
  • [enhancement][ci][need-simple-ci] use clang tidy appimage in ci #5841
  • [enhancement][ci] Use gcc 7 in release to prevent error #5840
  • [enhancement][ci] bn tol 1e-4 => 1e-3 #5811
  • [enhancement][ci] fix distributed run on built dir #5810
  • [enhancement][ci] fix third party mirror check_sum #5802
  • [ci][documentation] find more accurately which files need to be doctested #5782
  • [enhancement][ci] Print stack unconditionally #5779
  • [enhancement][ci][need-simple-ci] Enable more checkers for clang-tidy in CI #5738
  • [enhancement][ci] CI: add clang-tidy check to test.yaml #5920
  • [ci][documentation] fix docstring in oneflow.nn.functional namespace #5807
  • [enhancement][ci] disable TREAT_WARNINGS_AS_ERRORS in Release CI #5886
  • [enhancement][ci] Skip ci jobs by git diff #5863
  • [bug][ci] quick fix #5978 #6030
  • [enhancement][bug][ci] fix clang tidy diff options and file format #5990
  • [enhancement][ci] add flow.relu #5847
  • [enhancement][ci] equal => allclose #6164
  • [bug][ci][need-simple-ci] CI: fix clang tidy checks in simple ci #6161
  • [enhancement][bug][ci][documentation][api] add interpolate and layer_norm docs #6157
  • [bug][ci] update speed test #6113
  • [enhancement][bug][ci][documentation][api] speed import oneflow #6107
  • [bug][ci] Also try install dev deps for CODEGEN_PYTHON_EXECUTABLE #6115
  • [bug][ci][need-simple-ci] set gtest_CMAKE_DEBUG_POSTFIX "d" #6085
  • [enhancement][ci] add cache init file for clang and CI build with clang #6062
  • [enhancement][ci] add emoji in speed test output, make it continue-on-error #6214

Test enhancements:

  • [bug][test][interface] Fix acos ci bug #5217
  • [feature][test] implement automated test #5321
  • [enhancement][test] move generator test into ops folder to accelerate tests #5472
  • [feature][test][api] Add autotest part2 #5467
  • [enhancement][test][api][interface] Add some tests with the new framework for auto testing #5561
  • [bug][test] fix test error when do multi case test on graph #5590
  • [enhancement][test] Refine module test using auto test by yaochi #5484
  • [enhancement][test] Add autotest for BatchNorm2d #5734
  • [enhancement][test] RTH_update_op_test #5823
  • [enhancement][test] dev adamw graph config #5745
  • [feature][test][api][interface] Add new autotest #5562
  • [bug][test] restore test of alexnet graph #5798
  • [enhancement][test][interface] add zhangshen op-test #5600
  • [feature][bug][tooling][test][interface] Record autotest wrong code #5923
  • [enhancement][feature][test][api] add randint #5718
  • [bug][test] fix multi machine test #5984
  • [enhancement][test][interface] some op test #6095

Tooling enhancements:

  • [bug][tooling] user/summary: fix memory leak in FillImageInSummary #5742
  • [enhancement][tooling][cfg] cfg: add move assignment operator for performance #5962
  • [enhancement][tooling][api][refactor] refactor_all_device_placement_api #6080
oneflow -

Published by jackalcooper about 3 years ago

Changelog

v0.5rc1 (13/09/2021)

Highlights

  • First class support for eager execution. The deprecated APIs are moved to oneflow.compatible.single_client
  • Drop-in replacement of import torch for existing Pytorch projects. You could test it by inter-changing import oneflow as torch and import torch as flow.
  • nn.Module for eager execution
  • nn.Graph for lazy execution
  • DDP for data parallel

A sneak peek of the new API

Here is a minimum example showcasing how to incorporate a nn.Module in a nn.Graph and have it run in lazy mode.

class NeuralGraph(flow.nn.Graph):
    def __init__(self, ...):
        super().__init__()
        self.model = model # model is a nn.Module instance

    def build(self, x):
        y_pred = self.model(x)
        return y_pred

graph = NeuralGraph() # to create a nn.Graph instance
y_pred = graph(x) # to run the created nn.Graph

New in Python API

  • [feature][eager][op][test][python][interface] Add test for convtranspose2d #5239
  • [enhancement][python][interface] Add GroupNorm #5175
  • [enhancement][eager][python][interface] [Add] avgpool1d avgpool3d #5165
  • [feature][eager][op][python][interface] Add deconv cpu impl #5224
  • [bug][eager][api][python][interface] Fix acosh bug #5221
  • [feature][eager][op][python][interface] Dev modules ctc loss #5168
  • [bottleneck][bug][documentation][python][interface] Fix meshgrid test bug #5208
  • [eager][documentation][python][interface] Rename CosineScheduler to CosineAnnealingLR #5112
  • [feature][eager][python][interface] Add meshgrid module #5205
  • [enhancement][feature][bug][op][python] support bias in conv2d's parameter list #5322
  • [eager][documentation][api][python][interface] add not_equal, greater_equal and less_equal module #5350
  • [enhancement][eager][python] refine pow module and its test #5319
  • [enhancement][eager][op][python] Add triu op #5329
  • [enhancement][bug][python] Fix optimizer for not supporting all kinds of iterables #5355
  • [bug][python][interface] raise IndexError in get_canonical_index to support for loop #5345
  • [bug][python][interface] tensor slice assign supports broadcasting #5344
  • [enhancement][op][python] add cpu group conv logic #5314
  • [enhancement][python] Add 'nn.Mish' module and corresponding functions #5310
  • [enhancement][build][python] Remove ONNX from setup py #5297
  • [enhancement][python][interface] [add] zeropad2d #5278
  • [feature][system][python][interface] Lazy nn.Graph FeedInputOpExpr #5458
  • [feature][python][interface] integrate nn.image.flip #5411
  • [bug][python] Fix issues in point of MultiClientSession #5469
  • [enhancement][bug][python] update HasAllMultiClientEnvVars() #5459
  • [enhancement][python] Add in_top_k function #5428
  • [enhancement][python] Dev add docstring #5449
  • [feature][api][python] MultiClientSession #5407
  • [documentation][python] remove --user #5431
  • [feature][python][interface] nn.Graph python #5309
  • [feature][python][interface] Fea/nn graph/graph name #5413
  • [bug][python][interface] rm nn.Graph.train #5424
  • [op][documentation][api][python][interface] add bernoulli module #5353
  • [enhancement][python] flow.S/B/P #5306
  • [enhancement][documentation][python] Add instruction on upgrade pip #5400
  • [enhancement][python] Rm oneflow export and experimental #5589
  • [bug][python] Fix nn.graph.utils module conflict #5598
  • [feature][ci][python] Update autotest framework #5520
  • [enhancement][python] copy of_proto_python_dir to compatible_single_client_python #5539
  • [enhancement][api][python] del default env init #5537
  • [enhancement][python] Fix single client using same glog file #5535
  • [bug][api][python] Fix Session TryClose #5531
  • [enhancement][feature][python] split vector-matrix norm #5478
  • [feature][eager][op][python][interface] Add more upsample kernel #5382
  • [enhancement][feature][test][python] add torchstyle unittest #5489
  • [feature][system][python] nn.Graph with training #5662
  • [enhancement][feature][python] Fea/nn graph/block proxy func #5727
  • [enhancement][api][python] consistent_tensor_to_api #5703
  • [feature][eager][op][python] Dev Align torch avgpool #5610
  • [enhancement][python] fix circular deps of sbp python module #5706
  • [documentation][python] [part5]Remove singleclient outdated api #5674
  • [enhancement][python] [part4]Remove singleclient outdated api #5672
  • [bug][op][python] remove outdated code in conv3d #5696
  • [enhancement][test][python] enlarge tolerance of dataloader test #5689
  • [enhancement][test][python] add autotest for some math ops #5646
  • [feature][python] nn.Graph optimizer part 2: add L2, pass job complete, refactor #5604
  • [enhancement][python] Add clip_grad_norm #5299
  • [purge][python] Remove Single-Client API in oneflow default python #5827
  • [bug][python] Fix ddp grad size #5834
  • [enhancement][feature][python] Dev RMSprop graph conf #5768
  • [enhancement][purge][eager][python] remove scale arg in optimizer #5821
  • [enhancement][feature][python] graph/block io check #5803
  • [enhancement][feature][python] Dev adam graph conf #5709
  • [purge][python] [part10]Remove singleclient outdated api #5756
  • [feature][api][python] better repr of nn.Graph for debug #5762
  • [bug][python] fix weight decay in RMSprop #5755
  • [purge][python] [part9]Remove singleclient outdated api #5752
  • [purge][python] [part8]Remove singleclient outdated api #5750
  • [documentation][python] add first batch of methods in oneflow.nn.functional namespace #5693
  • [purge][python] [part6]Remove singleclient outdated api #5704
  • [bug][python] use default_generator.seed() as random_seed in init #5721
  • [bug][system][python] ddp broadcast params and buffers #5913
  • [enhancement][test][python] Add consistent tensor requires grad test #5925
  • [bug][python] wrap flow.nn.init.* with flow.no_grad() #5932
  • [feature][api][python][interface] add clip_grad to optimizer #5817
  • [enhancement][ci][op][test][python] add randperm with test and docs #5680
  • [feature][api][python] Fea/nn graph/ lr_schedule(and cosine lr_sch) and opt_group #5846
  • [bug][python] fix bug of SyncOnMasterFn atexit #5909
  • [purge][python] Delete single client nn modules #6061
  • [enhancement][python] Move framework.distribute to env #6022
  • [bug][python] skip sync when abnormally exiting #6025
  • [feature][python] Fea/nn graph/warmup amp config #5969
  • [documentation][python] add optimizer api docs #6131
  • [documentation][python] add_tensor_api_doc #6127
  • [bug][python] Fix test_grid_sample.py and test_affine_grid.py threshold #6125
  • [documentation][api][python] add doc of graph #6093
  • [bug][python] Fix make of_format fail in ubuntu #6120
  • [feature][api][python][interface] Fea/graph helpers #6088
  • [enhancement][eager][python][interface] Use flow.randint in dataloader #6086
  • [feature][eager][api][python][interface] Import oneflow as torch #6076
  • [enhancement][test][api][python][refactor] rename OfrecordReader to OFRcordReader #6090
  • [purge][python][need-single-client-tests] Delete single client nn modules #6082
  • [enhancement][python] flow.load tolerates FileNotFound fault #6083
  • [feature][python] Fea/pipeline in graph #6105
  • [enhancement][test][python] graph activation checkpointing #6192
  • [enhancement][feature][op][python] rnn test #6165

New in Ops:

  • [enhancement][op][api][refactor] [Functional] Part2: Add partial unary and math functional apis #5218
  • [enhancement][bug][op][interface] Refine deconv kernel #5229
  • [enhancement][op][api][interface] add ReflectionPad2d #5172
  • [feature][eager][op][api][interface] crossentropyloss and nllloss support ignore_index #5195
  • [feature][eager][op][api][interface] Yejiaojiao/dev bcewithlogitsloss #5173
  • [bug][ci][op] Dev user op set default is_dynamic #5223
  • [enhancement][op] add magic method for pow #5199
  • [enhancement][op][interface] add cpu version of upsampling #5194
  • [enhancement][bug][op][api][interface] add ReplicationPad2d #5148
  • [feature][eager][op][api][interface] add kldivloss module #5155
  • [feature][eager][op][documentation][build][api][interface] Add floor module and the corresponding testcases #4964
  • [enhancement][feature][op] Dev conv1d module #5280
  • [enhancement][op] Add ctc_greedy_decoder op #5294
  • [enhancement][op][system] Dev remove default grad func #5320
  • [enhancement][op][system] Add pad grad func. #5354
  • [enhancement][op][system] Add gradient funcs. #5348
  • [feature][purge][bug][eager][op][interface] fix upsample nearest bug #5347
  • [enhancement][op][system] [Functional] Part7: Migrate pooling ops #5253
  • [enhancement][op] nvjpeg hardware acc #5240
  • [enhancement][feature][ci][eager][op][api][interface] Add bmm module #5334
  • [enhancement][eager][op] Dev image decode eager #5333
  • [enhancement][op] Optimize softmax warp impl #4977
  • [enhancement][eager][op] Dev tensor buffer eager #5317
  • [enhancement][op][api][refactor] [Functional] Part6: Migrate conv op #5252
  • [enhancement][eager][op] Dev sort eager #5284
  • [enhancement][bug][op][api] fix bceloss bug in default weight and reduction #5303
  • [bug][eager][op] remove redundant assert and check #5264
  • [enhancement][bug][ci][op] fix bceloss bug about weight #5269
  • [enhancement][op][api][refactor] [Functional] Part5: Migrate nn ops #5249
  • [enhancement][eager][op] Dev argsort eager #5273
  • [enhancement][op][api][refactor] [Functional] Part4: Migrate array ops #5247
  • [enhancement][op][api][refactor] [Functional] Part3: Migrate binary and activation ops #5246
  • [bug][ci][op][test] Dev fix rmsprop ci fail #5481
  • [enhancement][op] add inplace method: Tensor.sin_ #5471
  • [bug][op] hotfix image_batch_align #5461
  • [enhancement][eager][op][interface] Dev maxpool series op 123d #5244
  • [bug][op] fix pool gpu kernel #5446
  • [feature][eager][op][api][interface] add pixelshufflev2 module #5383
  • [enhancement][feature][ci][eager][op][documentation][api][interface] Add flow xxx and tensor xxx autotest #5386
  • [enhancement][feature][eager][op][api][interface] Modules chunk #5324
  • [enhancement][eager][op] add image normalize for eager #5402
  • [enhancement][eager][op] Dev batch align module #5401
  • [enhancement][eager][op] add coco reader module #5391
  • [enhancement][wip][op] Restruct Elementwise kernel #4130
  • [bug][op] Fix DecodeRandom reuse mem #5606
  • [enhancement][op] Align pytorch maxpool #5525
  • [enhancement][bottleneck][eager][op][api] implementation of constantpad-3d op #5529
  • [enhancement][eager][op] Add scale size for resize #5509
  • [enhancement][op][api][refactor] Dev optimize tensor setitem #5501
  • [enhancement][op] register uint8 dtypeto support dataloader #5499
  • [enhancement][op] Add unique.cuh #5487
  • [enhancement][op][api][interface] Dev ofrecord auto truncating #5412
  • [feature][op][system][interface] Feat: LazyInterpret::ApplyImpl support SourceUserOpExpr and Copy #5711
  • [enhancement][op][interface] Dev logical_and/or modules #5636
  • [enhancement][op] support any number positional arguments for ones and zeros op #5698
  • [enhancement][feature][eager][op] Add conv3d Module #5327
  • [feature][eager][op][api][interface] add batchnorm3d module #5631
  • [bug][eager][op] fix reduce min max backward bug #5651
  • [enhancement][op] Debug dim scatter #5371
  • [enhancement][op][interface] Dev eye #5583
  • [enhancement][eager][op] Dev minimum maximum #5576
  • [enhancement][op] Restruct activation grad op #5669
  • [enhancement][feature][eager][op] Rewrite activation function #5465
  • [bug][op][documentation] add oneflow.cat for documentation #5621
  • [enhancement][op] Lcy logsoftmax #5746
  • [feature][op][need-simple-ci] Feat empty op #5659
  • [enhancement][eager][op] Dev split #5714
  • [enhancement][op][interface] add index_select op #5661
  • [bug][op] fix nvjpeg hw acc #5851
  • [enhancement][op] Remove move in conv_cudnn #5828
  • [enhancement][op][interface] Dev logical_xor module #5694
  • [bug][eager][op] fix squeeze #5808
  • [enhancement][op] Get parallel_id and parallel_num through rank and world size in DDP #5717
  • [bug][eager][op] delete interpolate int type #5805
  • [bug][op] Fix bug in scatter #5743
  • [enhancement][op] Refactor: remove module not required, call function directly #5754
  • [enhancement][op] Remove modules not required(tan, erfc, log1p, scatter_nd) #5791
  • [enhancement][op] Refactor scatter, clamp and pow in cpp instead of in python #5715
  • [enhancement][op] Rm useless code in gather files #5687
  • [enhancement][eager][op] change flip_code to scalar #5786
  • [enhancement][bug][op][interface] fix upsample bug #5753
  • [bug][op][interface] Quick fix Lazy nn.Graph input/output OpConf.BlobConf.is_dynamic #5767
  • [enhancement][bug][eager][op] fix argwhere 0-dim bug #5760
  • [enhancement][eager][op] delete unused code #5744
  • [feature][op] Export fused_scale_tril op #5933
  • [bug][op] Fix backward bug in 3d #5908
  • [bug][op] Fix one_hot api limit #5927
  • [enhancement][eager][op] Dev where scalar #5797
  • [bug][op] fix grad error #5914
  • [feature][bug][op] Fix inplace op circle reference bug #5910
  • [enhancement][op] Move the judgment content to c++, And add scalar fmod #5854
  • [enhancement][op] Support combined_margin_loss op in flow.nn.modules #5830
  • [enhancement][op][api][interface] functional_one_hot #5315
  • [enhancement][op] Dev scalar op #5778
  • [bug][eager][op] fix gather kernel 0 shape #5888
  • [enhancement][op] add l2_normalize for mutl-client interfaces #5859
  • [feature][op] Export function softmax_cross_entropy #6056
  • [enhancement][op] Add int attr for functional adaptive average pool #6059
  • [enhancement][op][interface] dev full op #5955
  • [bug][eager][op] fix 0dim inplace add #6029
  • [feature][op][system][interface] Feat: nn.Graph image gpu decoder #6014
  • [enhancement][op][interface] dev optim_optim_lr_scheduler_multisteplr #5975
  • [enhancement][op] NopKernel #6035
  • [enhancement][eager][op][api] Dev tril op #6005
  • [enhancement][op] dev unfold and fold #5675
  • [enhancement][op] ResNet CUDA Graphs #6018
  • [enhancement][feature][op] add broadcast pow #6013
  • [enhancement][op][interface] init of op diag #5298
  • [op][documentation][api] Fix api document bug #6009
  • [enhancement][op] Dev fused functional #5954
  • [bug][op][build] Add nvcc flag -Werror cross-execution-space-call #6002
  • [bug][op] Fix Normalization grad function #5993
  • [enhancement][feature][eager][op][test][interface] Add fused self attention #5966
  • [enhancement][bug][ci][eager][op][api][interface] Try to fix var bug #5973
  • [enhancement][feature][eager][op][interface] add prod op #5867
  • [enhancement][eager][op][api] add glu op #6065
  • [enhancement][op] Align Torch.nn.functional poolXd #6184
  • [bug][eager][op] fix backward index for gamma beta #6149
  • [bug][op][system] Fix BroadcastMatmulGrad bug #6168
  • [enhancement][op][api] Add Int support for functional.avg/maxpool #6174
  • [bug][eager][op][api][interface] align dropout api name with pytorch #6170
  • [enhancement][op] support inplace operation for hardsigmoid #6137
  • [enhancement][bug][op] Fix do bias correction in Adam/AdamW #5960
  • [bug][eager][op][api][interface] fix repeat 0-dim tensor bug #6150
  • [enhancement][bug][op] Fix select_first_grad bug #6142
  • [bug][ci][eager][op][documentation][interface] Add clipgrad doc and contiguous #6130
  • [bug][op] Fix eager optim dynamic attr bug #6111
  • [enhancement][op] Support grid_sample and affine_grid operator #6038
  • [op][documentation] Export apis for documentation #6068
  • [enhancement][feature][bug][ci][eager][op][documentation][interface] transfer python function to c++ method #6114
  • [op][documentation] Dev functional batch_gather #6233
  • [enhancement][op][test] fix cross_entropy_loss and its test #5799
  • [bug][op] Use attr nd_sbp to check consistent #6222
  • [enhancement][op] Dev fused bn functional #6077
  • [enhancement][op] support default value in intlist #6201
  • [bug][op] fix sparse_softmax get_nd_sbp #6203
  • [bug][op] Fix bug in model fused update #6197
  • [enhancement][op][system][refactor] Optimize tensor getitem. #5433

New in Eager:

  • [enhancement][eager][interface] Reconstruct module files #5251
  • [bug][eager][documentation][interface] Fix conv module bug #5245
  • [bug][ci][eager][interface] Fix bce withlogitloss ci error #5237
  • [feature][eager][api][interface] module BCELoss #5144
  • [enhancement][feature][eager][api][interface] Dev norm op #5178
  • [enhancement][bug][eager] Fix stack module #5222
  • [enhancement][feature][eager] Support different dtype of equal module #5214
  • [enhancement][bug][eager][documentation][api][interface] Add nllloss backward #5210
  • [enhancement][eager][api][upload-core] Decouple FileSystem and IOConf #5162
  • [enhancement][ci][eager] Set lower precision avoid ci failing #5200
  • [eager][documentation] Add hint when apply FunctionNode second time #5369
  • [enhancement][feature][bug][ci][eager][documentation][api] Fix upsample bilinear bug #5366
  • [bug][eager] Fix not contiguous ndarray to tensor bug #5351
  • [enhancement][eager][system] Infer consistent tensor meta #5118
  • [feature][eager] Feat graph autograd engine #5296
  • [enhancement][eager][interface] Dev type as module #5349
  • [feature][eager][documentation][api][interface] Add new ones module #5342
  • [enhancement][bug][eager] Fix logical slice assign dtype #5339
  • [bug][ci][eager][documentation][api][interface] Fix where module bug #5300
  • [bug][ci][eager][documentation][api] Fix l1loss ci error #5307
  • [enhancement][bug][eager][documentation][api][interface] Qi's First Edit of deleting "print" and ".numpy" #5129
  • [feature][eager][refactor] Separate autograd meta to tensor #5267
  • [feature][eager][api][interface] add tile module #5234
  • [enhancement][eager] Release lambda function to reuse tensor memory #5266
  • [feature][bug][eager][documentation] Fix default value not set bug #5483
  • [enhancement][eager][interface] [Add] gather_nd scatter_nd #5422
  • [enhancement][bug][eager] fix param #5473
  • [bug][eager] Fix Tensor.grad setter bug #5462
  • [enhancement][eager] Rename now_grad_arg to current_grad #5466
  • [eager][test][documentation][interface] Add autotest part1 #5436
  • [enhancement][eager] Use functional copy instead of op_builder #5460
  • [bottleneck][bug][eager][interface] fix -1 index not support bug #5448
  • [bug][ci][eager][documentation][api] Fix concat backward bug #5443
  • [enhancement][bug][ci][eager] Add autograd engine warning #5444
  • [feature][eager][api][interface] Smoothl1loss #5256
  • [enhancement][bottleneck][eager] remove device dtype params #5434
  • [bug][ci][eager][documentation][interface] Delete maxpool failed test #5409
  • [enhancement][eager][api] Add tensor grad assginment #5379
  • [enhancement][bug][eager] fix-abs #5398
  • [enhancement][bug][eager][interface] Fix bn track running stats #5393
  • [enhancement][bug][eager] Support uint dtype of constant op #5396
  • [enhancement][bug][eager][documentation][interface] Delete useless code upsample #5392
  • [enhancement][ci][eager][interface] add flow.view #5301
  • [enhancement][bug][ci][eager][api][interface] Add masked select module #5356
  • [bug][eager][interface] Fix batchnorm backward bug #5602
  • [enhancement][eager] Support weight_dacay(l2 actually) #5587
  • [feature][eager][documentation][api] Add new autotest #5588
  • [enhancement][eager][documentation][api] Dev fmod #5404
  • [feature][eager] Support inplace add #5432
  • [feature][eager][interface] Feat tensor stride property #5543
  • [enhancement][feature][eager][documentation][api] Add flip module #5541
  • [feature][eager] Feat module repr #5486
  • [enhancement][bottleneck][bug][eager][interface] Fix maxpool1d params #5493
  • [enhancement][feature][eager][interface] Dev flow.utils.data part1 #5406
  • [bug][eager][api] Fix tensor getitem bug #5474
  • [enhancement][eager][need-simple-ci] export datasets interface #5691
  • [enhancement][eager][system] rebase #5601
  • [enhancement][eager][test] added nn.RecordBytesDecoder with its test #5475
  • [enhancement][feature][eager][need-simple-ci] 0-dim tensor support #5552
  • [enhancement][bug][eager] rewrite slice_update backward #5677
  • [enhancement][bug][eager][interface] align view input style with torch #5676
  • [enhancement][eager][interface][need-simple-ci] add autotests for modules #5666
  • [enhancement][bottleneck][eager][interface] Dev constantpad1d op #5579
  • [enhancement][eager][api][interface] Restruct MathOps AutoTest #5654
  • [enhancement][bug][ci][eager] Fix flip bug #5657
  • [bug][eager][api][interface] Fix expand module bug #5650
  • [enhancement][bug][eager][documentation][api] Fix repeat bug #5633
  • [enhancement][eager][test][api][interface] Add new autotest #5617
  • [enhancement][eager][api][interface] Dev flow.utils.data part2 #5500
  • [enhancement][bug][eager] make setitem device match #5835
  • [bug][eager][api][interface] align reshape input param with pytorch #5804
  • [feature][bug][eager][api] Align where op with torch #5850
  • [enhancement][bug][eager][api] Restruct prelu op #5829
  • [bug][eager][need-simple-ci] fix pooling ceil_mode bug #5818
  • [enhancement][eager] stateful local kernel supports consistent #5789
  • [bug][eager][api][interface] Fix argwhere bug #5816
  • [enhancement][eager][documentation][api] dev-nonzero #5809
  • [enhancement][feature][eager][api] Add fake quantize op #5690
  • [enhancement][bug][eager][documentation][api] Add api #5663
  • [enhancement][eager] Refactor consistent infer result #5790
  • [bug][eager][need-simple-ci] skip dataloader test #5780
  • [bug][eager][need-simple-ci] fix 0-dim tensor.fill_ #5771
  • [enhancement][eager] Cpu mpi broadcast #5726
  • [feature][eager] Feat grad mode classes #5956
  • [enhancement][bug][eager] fix wrong names #5951
  • [enhancement][eager][system] Local dep object pool #5953
  • [enhancement][eager][interface] rename OpExprInterpState to AutoGradCaptureState #5918
  • [bug][eager] Fix linear bug #5945
  • [bug][eager] Fix tensor_meta update bug #5924
  • [enhancement][eager] use flow.randperm #5928
  • [enhancement][eager] consistent init/save/load #5896
  • [enhancement][bug][eager][documentation][interface] Restruct sort and argsort op #5911
  • [enhancement][bug][eager][interface] Try to fix the problem that the insightface cannot converge。 #5906
  • [enhancement][bug][eager][interface] Add autotest #5899
  • [enhancement][eager] The scheduler thread joins worker threads #5893
  • [enhancement][eager] Bugfix async callback #5881
  • [feature][eager] Feat tensor to bool #5836
  • [bug][eager] Remove inplace broadcast_add #5551
  • [enhancement][eager] Broadcast consistent shape and dtype #5784
  • [enhancement][eager] Fix optimizer list parameters input bug #5848
  • [enhancement][eager][interface] Dev flow.utils.data part3 #5644
  • [enhancement][eager][api] Normalize naming of modules #6066
  • [enhancement][feature][eager][api][interface] add truncnormal #6051
  • [enhancement][bug][eager] AutoMatedTest support test module.parameter.grad #6043
  • [enhancement][feature][bug][eager] add module call kwags #6069
  • [enhancement][eager][api][interface] add tensor.item tensor.tolist #6021
  • [enhancement][eager][api][interface] Export pool ops api #6047
  • [enhancement][bug][eager][test][documentation][interface] Add more autotest sample #6039
  • [enhancement][bug][eager][system] disable cuda_h2d stream #6020
  • [feature][eager][test][api][interface] Add autotest codegen #6019
  • [feature][eager][documentation] Refactor cosine lr scheduler #6000
  • [enhancement][eager][interface] tensor.cpu/tensor.cuda #5894
  • [enhancement][eager][api] Support consistent_tensor.to(dtype) #5991
  • [bug][eager][interface] remove redundant codes in ModuleDict #5961
  • [bug][eager] Fix LayerNorm check bug #6196
  • [enhancement][eager][api] Change dropout api #6182
  • [enhancement][good for pr][eager][api][interface] add: test convert dependency #6023
  • [enhancement][bug][eager][interface] Fix autotest codegen bug #6171
  • [bug][eager] restore instr_local_dep_object_pool_size for nccl #6160
  • [enhancement][eager][api][interface] Aligin pooling op functional api names with torch #6163
  • [feature][bug][eager][api][interface] delete file #6162
  • [bug][eager] Fix optim load_state_dict bug #6152
  • [enhancement][eager][api] add is_training to dropout functor #6148
  • [enhancement][eager] Decompose nd sbp boxing #5800
  • [enhancement][eager] support consistent_tensor.to(copy=True) #6122
  • [feature][eager] Static grad scaler #6135
  • [bug][eager] Fix LayerNorm expr bug #6121
  • [bug][eager][api] move numpy c api init in numpy.cpp, make np array contiguous before copying #6117
  • [enhancement][eager][refactor] Remove params from ParamGroup getitem #6096
  • [enhancement][feature][eager] Support tensor and optimizer serialization #6087
  • [enhancement][bug][eager] fix bug about tensor str in nonsymmetric cast and getitem in consist… #6239
  • [enhancement][eager] Cpu all reduce #5849
  • [feature][eager] Support assign copy interface #6228
  • [enhancement][eager][api][interface] Dev reconstruct pad ops #6223
  • [enhancement][eager][api][interface] support flow.cuda.is_available #6124
  • [bug][eager] make flow._C.local_all_reduce sync lanuched #6175
  • [enhancement][eager] Rename flow to oneflow in user hint #6190
  • [bug][eager][tooling][test][api][interface] Autotest generate input tensor #6206
  • [enhancement][eager] consistent tensor zeros_() #6202
  • [enhancement][eager] Cpu mpi #5865

Build enhancements:

  • [bug][build] Fix GRPC compilation failure on CMake 3.20 #5255
  • [bug][build] Refine header file copy #5254
  • [bug][build] Fix older version CMake doesn't support multiple targets in CLI #5248
  • [bug][build] Turn off NCCL_STATIC/CUDNN_STATIC when CUDA_STATIC is OFF #5243
  • [feature][build] Fix support for Ninja and add Ninja build in Simple CI #5236
  • [enhancement][build] Add cmake option CUDA_STATIC #5164
  • [bug][build] Fix protobuf debug postfix #5233
  • [enhancement][ci][build] Move default third party dir into build dir #5230
  • [enhancement][build] Refine protobuf cmake #5216
  • [enhancement][ci][build] Remove transport test main #5215
  • [enhancement][ci][build] Speedup opencv build #5213
  • [enhancement][build] Support clang #5015
  • [enhancement][documentation][build] Add prefix when creating git archive #5201
  • [enhancement][build] Add cmake option NCCL_STATIC #5160
  • [enhancement][build] Refine CMake CUDA version handling #5192
  • [enhancement][build] Use clang plugin to check Maybe variables are used #5358
  • [enhancement][build] Add BUILD_BYPRODUCTS for ExternalProject_Add #5316
  • [enhancement][build] Add cmake init cache to simplify user onboarding #5311
  • [feature][bug][build] Fix macOS support and run macOS build in Simple CI #4947
  • [enhancement][build] flatbuffers use mirror #5295
  • [enhancement][build] Don't build test by default #5302
  • [enhancement][build] Prevent building from scratch when toggle flag BUILD_GIT_VERSION #5259
  • [enhancement][build] Refine gRPC, glog, gflags cmake for conda #5276
  • [feature][build] Support XLA with CPU-only #5260
  • [enhancement][ci][onnx][build] Remove ONNX from CI #5257
  • [enhancement][build] Refactor build_wheel to support oneflowinc images #5427
  • [enhancement][build] Add arg skip_audit in build wheel #5423
  • [bug][build] hwloc disable shared #5388
  • [documentation][build] Update readme for autoconf and libtool #5376
  • [enhancement][build] remove dir python and compatible_single_client_python #5609
  • [bug][build][system] Fix pyyaml version #5594
  • [enhancement][ci][build] force release flags #5574
  • [bug][build] prevent endless loop #5534
  • [enhancement][build] Support sccache #5528
  • [enhancement][build] Add definition for CMAKE_BUILD_TYPE and print cmake_build_type in oneflow doctor #5505
  • [enhancement][ci][build][need-simple-ci] Fix macOS for recent changes #5705
  • [bug][build] fix return type error on gcc 4.8.5 #5660
  • [enhancement][build] Check CMAKE_BUILD_TYPE #5656
  • [enhancement][build] add -Werror=return-type #5655
  • [enhancement][build] Clean and fix for new py dir #5618
  • [enhancement][build] cmake: disable array-bounds check & treat warnings as errors for pyextobj and oneflow_internal & fix warnings #5838
  • [bug][build] set CMAKE_BUILD_TYPE to Release if undefined #5842
  • [enhancement][build][need-simple-ci] Fix all warnings & Add option TREAT_WARING_AS_ERROR to cmake #5751
  • [enhancement][build] add CMAKE_INTERPROCEDURAL_OPTIMIZATION in fast cmake cache #5970
  • [enhancement][build] add clang tidy target #5957
  • [bug][build] cmake: fix cmake cache args in opencv #5959
  • [enhancement][build] Add cmake option USE_SYSTEM_NCCL #5897
  • [enhancement][build] cmake: include third party headers as system headers to avoid warnings #5879
  • [enhancement][build] Ignore opencv-python on machine aarch64 #5884
  • [enhancement][build] enable CMake first class cuda support #5858
  • [bug][build] Fix compile warning (strict-aliasing) #5872
  • [enhancement][bug][build][need-simple-ci] Upgrade gtest and fix some errors raised by clang #6079
  • [bug][ci][build] cmake: fix ninja build in CI #6072
  • [bug][build] fix files not actually removed when building for multiple python versions #6060
  • [bug][build][api] functional_api: fix build error in mac os #6010
  • [bug][build][need-simple-ci][need-single-client-tests] Fix recompile from scratch #6036
  • [bug][build] Turn on NVCC's warnings #6011
  • [bug][build][need-single-client-tests] fix bundle .so of other python version #6034
  • [bug][ci][build][need-single-client-tests] use copy_all_files_in_dir to replace copy_files #6033
  • [enhancement][build] check compiler version in cmake #6026
  • [enhancement][build] Add CUDA_NVCC_THREADS_NUMBER #6017
  • [enhancement][build][need-simple-ci] optimize of_include_copy #5978
  • [enhancement][ci][build][need-single-client-tests] CI: remove -DTREAT_WARNINGS_AS_ERRORS=OFF #6008
  • [enhancement][build][xla] xrt: fix all warnings #5915
  • [enhancement][build] Prevent opencv compile failure with std 17 #5997
  • [enhancement][build] Use bundled cub #5998
  • [enhancement][ci][build] update clang tidy diff warnings-as-errors option #5989
  • [enhancement][build] Update run_clang_tidy.py to set return code and add warning-as-errors #5977
  • [enhancement][build] check: fix clang-tidy-diff commands #5972
  • [bug][build] Suppress NVCC warning #177-D #6094

XLA enhancements:

  • [bug][xla] Make the blob header memory aligned. #5286

System:

  • [enhancement][system] Refactor Memory Zone #5072
  • [enhancement][system] Add interface InferContext::OutputTensorDesc #5219
  • [bug][system] Lazy construct functor to make sure that the operators has already been registered. #5225
  • [enhancement][system] Refactor infer ctx output isdynamic #5220
  • [enhancement][system] Refactor infer ctx input isdynamic #5211
  • [enhancement][system] Wake up the heartbeat thread immediately #5081
  • [enhancement][system] Fix xla test case fail #5203
  • [enhancement][system] Add interface InferContext::InputDType #5153
  • [purge][system] delete const_cast in Output #5196
  • [feature][system] Add hwloc for topology detection #5291
  • [enhancement][system] fix registry may segment #5336
  • [enhancement][system] Use functional api instead of op_expr_helper::XXXOp. #5364
  • [enhancement][system] move btob to op #5274
  • [documentation][system] Add Latest News section in README #5361
  • [enhancement][bug][system] fix dropout module: return directly if not training #5346
  • [bug][system] add missing JUST #5357
  • [documentation][system] Add more communication outlets on README #5359
  • [enhancement][feature][system] CommNet dynamic register memory #5281
  • [enhancement][system] Use symbol device #5341
  • [enhancement][system] fix multithread bug in env #5283
  • [bug][system][api] fix bug in cfg_replacement #5335
  • [bug][system] Fix create log directory thread-unsafe #5326
  • [bug][system] fix_bug_in_make_parallel #5328
  • [enhancement][system][cfg] replace train_conf, job_conf using cfg::xx #5263
  • [enhancement][system][quantization] support tensorrt in qat #5287
  • [enhancement][system][api] Export functional apis for oneflow.experimental. #5313
  • [enhancement][system] fix bug check between cfg enum and proto enum #5285
  • [enhancement][system] replace CHECK_EQ using CHECK_EQ_OR_RETURN #5279
  • [enhancement][system] Refactor SbpXXX to cfg::SbpXXX #5120
  • [enhancement][system][api] add detach for LazyMirroredtensorImpl #5270
  • [enhancement][system] shorten XXIsDynamic4ArgNameAndIndex to be xxIsDynamic #5265
  • [enhancement][system][cfg] job_config to cfg #5235
  • [feature][system] Multi-Client LogicalRun degenerate to PhysicalRun #5479
  • [enhancement][system] fix ConstructOp without JUST #5480
  • [enhancement][system] Output arg modifier return maybe part 1 #5451
  • [feature][system][interface] Fea/nn graph/graph build ctx #5420
  • [enhancement][system] Throw exception if check failed #5457
  • [feature][system] multi client launch #5372
  • [enhancement][system][api] Optimize reduce mean #5452
  • [enhancement][system] export Tensor only to python #5440
  • [enhancement][system] Output arg modifier return maybe part_0 #5447
  • [enhancement][system] ThreadMgr support AddPlan #5450
  • [enhancement][system] Refactor infer ctx input tensordesc #5226
  • [enhancement][system][api] instruction builder return maybe #5442
  • [feature][system][interface] MultiClientSessionContext #5421
  • [enhancement][feature][system] add launcher, update multi client launch and exit #5414
  • [purge][system][refactor] Remove IOConf #5419
  • [enhancement][system] Dev refine generator #5426
  • [enhancement][system] Support inplace operations #5204
  • [enhancement][system][refactor] Dev refactor generator #5397
  • [enhancement][system] Add new placement init func #5408
  • [enhancement][system] NNGraphIf #5387
  • [enhancement][system][refactor] Cast explicitily in unpack call to avoid confilt with Optional. #5380
  • [enhancement][system][interface] [Random Generator] Part2: Migrate functional dropout #5378
  • [enhancement][system] replace ForeignJobInstance using JobInstance #5374
  • [enhancement][system][refactor] Speedup reshape module by 5x. #5381
  • [feature][system][interface] [Random Generator] Part1: Dev random generator #5360
  • [enhancement][system] Add ONEFLOW_STREAM_CUDA_EVENT_FLAG_BLOCKING_SYNC #5612
  • [enhancement][system] [part2]Remove singleclient outdated api #5568
  • [feature][system][interface] nn.Graph call and launch impl #5580
  • [enhancement][system] remove outdated doctest api and "@experimental_api" #5564
  • [feature][system][interface] Register ForeignCallback and Watcher in Multi-Client #5591
  • [enhancement][system] [Part-1]remove outdated api and files of multi-client on master branch #5556
  • [feature][system][interface] LazyInterpret build LocalTensor if input is local #5582
  • [enhancement][system] add job_pass MultiClientAutoSourceAndSinkTick #5507
  • [feature][system] Fea/nn graph/optimizer #5533
  • [feature][system][interface] New/CloseRuntimeBuffers and RunLazyJob impl #5571
  • [feature][system][refactor][interface] NNGraph interface and implement for CompileAndRuntime #5558
  • [feature][system] Fea/nn graph/forward graph #5516
  • [enhancement][system] Lazy job stream type #5389
  • [enhancement][system] Refactor single client autotick #5506
  • [enhancement][system] replace underline using dot in single client #5547
  • [bug][system] fix return type #5548
  • [feature][system][interface] LazyInterpret for UserOpExpr #5544
  • [enhancement][system] Add ProfilerStart/ProfilerStop API #5542
  • [feature][system][interface] LazyInterpreter for FetchOutputOpExpr and set op parallel_distribution #5527
  • [enhancement][system] Multi client push pull #5492
  • [enhancement][system] registry_callback_fn return maybe #5456
  • [enhancement][system] bw_gen_fn return maybe #5455
  • [enhancement][system] gen_bw_fn return maybe #5454
  • [enhancement][system] Compatible single client #5417
  • [feature][system][interface] GlobalMultiClientEnv and refine EagerExecution #5523
  • [enhancement][system] Job pass maybe system #5503
  • [enhancement][system] Remove Plan::net_topo #5502
  • [feature][system][interface] LazyInterpret for FeedVariableOpExpr #5490
  • [enhancement][system] Input arg modifier return maybe #5453
  • [feature][system][interface] Fea/nn graph/block scope #5498
  • [feature][system] jit_fuse_cast_scale #5332
  • [enhancement][system] Remove obsolete Profiler #5747
  • [enhancement][system][api] Dev fix batch norm not stats #5733
  • [enhancement][system] rename rpc_token to TransportToken #5735
  • [enhancement][system][api] Refacotr maximum minimum py2cpp #5724
  • [enhancement][system] Replace piece_id with comm_net_sequence_number #5731
  • [enhancement][system] beautify stack frame #5686
  • [enhancement][system] Add env ONEFLOW_KERNEL_DISABLE_BLOB_ACCESS_CHECKER #5728
  • [enhancement][system] Add env ONEFLOW_THREAD_ENABLE_LOCAL_MESSAGE_QUEUE #5720
  • [enhancement][system][api][refactor] Refactor functional sub, mul and div apis #5713
  • [feature][system] ddp #5008
  • [enhancement][system][api][refactor] Refactor functional matmul and add apis. #5697
  • [bug][system] Fix ClearKV("plan") #5710
  • [enhancement][system] Rename cpu to async cpu #5712
  • [enhancement][system] Support tensor.to()/to_local() #5271
  • [feature][system][refactor][interface] Multi-Runtime for multi nn.Graph #5683
  • [bug][system][refactor] Add tag for Optional inplace constructor #5619
  • [enhancement][system] Move Global to env scope #5670
  • [enhancement][system] add JUST wrapper #5681
  • [enhancement][system] New sync consistent meta info #5634
  • [enhancement][system][refactor][interface] Refactor RuntimeCtx for multi-runtime #5664
  • [feature][system][interface] Feat: memory shared between EagerTensor with VariableRegst #5649
  • [enhancement][system] Use functional call directly instead of construct a module and then call-Add #5613
  • [enhancement][system] disable eager_op consistent mode #5647
  • [enhancement][system] add msg_penddin_list in ibverbs_qp to optimize qp_init_attr.cap.max_send_wr #5485
  • [enhancement][system] IBVerbsCommNet add knobs #5626
  • [enhancement][system] Prune python tensor #5596
  • [feature][system][interface] Feat: LazyInterpret infer op / tensor ParallelDescScope #5625
  • [enhancement][system] Replace src tick with with wait and send ids #5603
  • [enhancement][system] Support symbol placement type in functional. #5627
  • [enhancement][system][api][refactor][interface] Dev advanced indexing #5559
  • [enhancement][system] Optimize maybe. #5839
  • [enhancement][system] Decorator 4 disable recursive boxing call #5796
  • [enhancement][system] add_eager_boxing_and_op_interpreter_dispatch_error_info #5819
  • [enhancement][system] Kernel CUDA Graphs Support #5725
  • [bug][system] Fix placement print bug #5853
  • [bug][system] when error msg formatting fails, return error->DebugString #5844
  • [enhancement][system][refactor] Rename variables named *parallel_distribution* to *nd_sbp* (1) #5815
  • [feature][system][interface] Support Free EagerTensor caught in nn.Graph build #5777
  • [enhancement][system] Reuse CUDA event / Refine BnInOp2Blob / Refine channel #5837
  • [enhancement][system][serving] fix bug in AddInputOutputOpsPass: check existence of key in HashMap(inferface_lbi2scope_sym_id) #5653
  • [enhancement][system][api] unpack_call: impl new unpack_call_dispatcher for better performance #5820
  • [feature][system] Feat consistent tensor python constructor #5812
  • [feature][system] Support 0shape tensor #5620
  • [documentation][system] fix launcher description #5770
  • [feature][system][interface] Multi-nn.Graph memory reuse by Chunk manager #5658
  • [bug][system] Fix naive b2p error #5806
  • [enhancement][system] set created generator with default rng seed #5801
  • [enhancement][system] enhance_local_to_consistent #5761
  • [feature][system] add flow.randn #5736
  • [enhancement][system] Refactor hierarchical parallel cast autograd #5764
  • [enhancement][system] Collective boxing executor add_plan delete_plan #5495
  • [enhancement][system] Fix throw abort #5795
  • [enhancement][system] DECORATE #5794
  • [enhancement][system] Inferface eager boxing #5682
  • [enhancement][system] extract_consistent_to_consistent_op_expr #5870
  • [enhancement][system] disable backward pass consistent tensor meta check. #5871
  • [enhancement][system] Add CudaStreamIndexGenerator::GenerateNamedStreamIndex #5940
  • [bug][system] Only query PCI bus id when CUDA version >= 11 #5937
  • [enhancement][system] maybe: add JUST_MSG and CHECK_JUST_MSG #5904
  • [bug][system] Fix bug scalar #5950
  • [enhancement][system] framework: fix rvalue reference warnings #5948
  • [purge][system] Remove CudaWorkType #5942
  • [enhancement][system] refactor_symbol #5941
  • [bug][system] consistent_tensor_infer_cache: fix memory leak #5938
  • [feature][system] support to print gpu #5936
  • [enhancement][system] Bugfix static check #5935
  • [bug][system] fix nccl_version log #5934
  • [bug][system] Fix bug of multi-GPU train nn.Graph extra mem cost in rank 0 #5930
  • [enhancement][system] Only gradient acc be scheduled in parallel. #5926
  • [enhancement][bug][system] fix_ddp_bug_on_8_process #5929
  • [enhancement][system] Fix bug error msg format #5866
  • [feature][system] print consistent tensor data #5902
  • [bug][system] Move parse env to the constructor #5922
  • [enhancement][system] Remove GlobalWorkStreamId/GlobalThrdId #5917
  • [bug][system] shared_or_scalar: fix alias warnings #5916
  • [purge][system] Remove CompActor #5919
  • [enhancement][system] Use symbol dtype #5641
  • [enhancement][feature][system] Control Graph / Session / Env's python c++ object destruction #5845
  • [enhancement][bug][system] Sync access and assign indexing tensor. #5907
  • [enhancement][system][api][refactor] Dev consistent arange #5883
  • [enhancement][system] Lazy interpreter for new ConsistentToConsistentOpExpr #5903
  • [bug][system] Fix BUG of LazyInterpret FreeEagerTensor memory shared with regst #5891
  • [bug][system] fix typo in raise RuntimeError #5890
  • [enhancement][system][refactor] Rename the ParallelDistribution class to NdSbp #5814
  • [feature][system] add flow.rand #5722
  • [feature][system] Lazy Interpret support infer default device cpu #5880
  • [enhancement][system] Tensor str #5783
  • [feature][system][interface] Lazy to_consistent #5774
  • [enhancement][system] wait vm empty before exiting #5860
  • [enhancement][system] Eager boxing n to 1 #5949
  • [enhancement][system] add kernel observer #6052
  • [enhancement][ci][system] Optimize ddp broadcast and add speed/memory test in ci #6044
  • [enhancement][system] add var to control only print warning once when blocked #6045
  • [enhancement][system][refactor] Rewrite pow and logical functional apis #6032
  • [enhancement][system] Token seq id #5964
  • [enhancement][documentation][system] Remove python function wrapper. #6012
  • [feature][system] Add timeout and loc for blocking calls #6007
  • [enhancement][system] Eager boxing 1 to n #5943
  • [enhancement][system] Boxing expr #6015
  • [enhancement][system] new_X_to_B #5987
  • [enhancement][system] Add unimplemented return information #5952
  • [enhancement][system] Revert "Faster decorator" #6006
  • [enhancement][system] Throw exception if using advanced indexing for tensor setitem #6001
  • [enhancement][system] Support eager boxing sm 2 sn #5869
  • [enhancement][system] Move framework/local_dep_object.* to the eager directory #5988
  • [enhancement][system] Fix builtin op arg tuple. #5464
  • [feature][system][refactor] Dev functional multiple signatures #5982
  • [enhancement][system] Faster decorator #5996
  • [enhancement][system] Placed nd sbp #5995
  • [feature][system] Support asymmetric input/output/variable tensors in nn.Graph #5983
  • [enhancement][system] LightActor #5868
  • [bug][system] Prevent running oneflow in forked subprocess #5976
  • [bug][system] common/error: fix build error in mac os #5971
  • [bug][system] fix_bug_test_tensor_str #5958
  • [enhancement][system] Refine StreamContext #6191
  • [enhancement][system] container_util: fix VectorAt, remove useless MutMapAt #6172
  • [enhancement][system] Typesafe KernelState #6198
  • [enhancement][system] Primitive based copy task node #6195
  • [feature][system][interface] Lazy support Scalar #6181
  • [enhancement][system] Disable implicit boxing when parallel num eq one #6188
  • [enhancement][system] Primitive #6183
  • [enhancement][system] Remove IDMgr::GetGpuPhyIdFromThrdId/IDMgr::GetDeviceTypeFromThrdId #6169
  • [enhancement][system] remove op_expr_helper inside gradient_funcs #6057
  • [feature][system][api] Add tensor yaml, support export tensor functional api. #6099
  • [feature][system] Plan memory log #6151
  • [feature][system] Add dtype bfloat16 #5304
  • [enhancement][system] StreamContext #6129
  • [bug][system] Fix wrong inplace acc grad #6146
  • [enhancement][system] UserKernel remove job_desc #6144
  • [enhancement][system][api] Fea/graph/add outputs buffer to enable pipeline #6126
  • [enhancement][system] not fuse request for nccl 2.10.3 #6136
  • [bug][system] NewUniqueId thread safe #6141
  • [enhancement][system] XRT remove job_desc #6139
  • [enhancement][system] SystemOpFillJobNamePass #6138
  • [enhancement][system] mv_boxing_folder_to_core #6140
  • [enhancement][system] Refactor boxing interpreter to boxing expr #6134
  • [enhancement][system] Eager boxing one to one #6048
  • [enhancement][system] Vm cpu efficiency #6110
  • [enhancement][system] Naive generic boxing #6116
  • [feature][system] send/recv #5992
  • [enhancement][system] disable_print_stack_in_tensor_numpy #6123
  • [feature][system] add all_reduce by to_consistent #5963
  • [enhancement][system] KernelContext #6084
  • [enhancement][bug][system] Fix sync nccl and async nccl deadlock #6071
  • [bug][system][refactor] Refactor to local #6098
  • [enhancement][system] Replace xor with hash combine (part 1) #6078
  • [enhancement][system] Optimize error message #6073
  • [enhancement][system] Rename Error::xx to Error::xxError #6049
  • [enhancement][system] send formatted msg to glog #5999
  • [feature][bottleneck][bug][system][interface] [Feat.] NNGraph new eager tensor for new variable created in JobPass #6091
  • [bug][system] Fix bug of multi-GPU eager copy D2H extra mem cost in rank 0 #6092
  • [enhancement][system][api] Rename module flow.F to flow._C #6053
  • [feature][system][interface] [Feat.] Eager consistent OFRecordReader #6089
  • [enhancement][system][api] Dev fix and align interface #6075
  • [feature][bottleneck][bug][system][interface] NNGraph input/output valid by register tensors #6240
  • [bug][system][interface] Fix bug of Multi-Client src tick output order #6221
  • [enhancement][bug][system] Add cast primitive #6234
  • [feature][bottleneck][system][interface] Auto FixPipelineStageIdPass #6204
  • [enhancement][system] move scalar to oneflow namespace. #6235
  • [enhancement][system] UserKernel init CUDA Graphs with state #6230
  • [feature][system] Comm broadcast #6213
  • [enhancement][system][refactor] Rename opname to optype_name in AutogradEngine #6154
  • [enhancement][system] Add memset primitive #6218
  • [enhancement][system] Add StreamContext::device_type()/DeviceCtx::device_type() #6217
  • [feature][system] add all_gather and fix bug of multi rank doctest #6189
  • [feature][system][interface] [Feat.] Lazy interpreter skip hierarchical_parallel_cast #6208
  • [purge][system] Cleanup KernelUtil #6212
  • [enhancement][system] StreamContextAdapter #6205
  • [enhancement][system] Dev eliminate gcc warnings #6199
  • [feature][bottleneck][system][interface] [Feat.] nn.Graph support grad acc with input/output tensor #6155
  • [enhancement][system] Cpu symetric s to s #6153
  • [enhancement][system][upload-core] Op expr infer tensor meta #5064
  • [enhancement][system] Infer consistent tensor meta #5362

CI enhancements:

  • [bug][ci][api][interface] Refine module test #5232
  • [enhancement][ci] Add Simple CI, runs CPU-only on GitHub hosted servers #5207
  • [enhancement][ci] Run exe test in CPU-only #5202
  • [enhancement][ci] Cancel all workflow runs but the latest #5206
  • [enhancement][ci] Fix master not running Simple CI #5368
  • [enhancement][ci] Refine Simple CI and Clang analysis #5367
  • [enhancement][feature][bug][ci][documentation][interface] Fix upsample bilinear bug #5363
  • [enhancement][ci] Build nightly for py39 #5318
  • [enhancement][ci] Try distributed run for 3 times to prevent failure #5305
  • [enhancement][ci] Upload Simple CI logs to cloud #5268
  • [enhancement][ci] Remove cpu_op_eager and cuda_op_eager #5470
  • [bug][ci] fix segfault in clang plugin #5437
  • [enhancement][ci] Refine Simple CI error output #5435
  • [enhancement][ci] Add conda env to Simple CI #5385
  • [enhancement][ci] Fix clang plugin core file not found #5390
  • [bug][ci] upload core when build with clang plugin #5384
  • [bug][ci] clang plugin skip more files #5373
  • [enhancement][ci] Use gh-action-scheduler-v2 #5370
  • [enhancement][ci] relax speed threshold #5569
  • [bug][ci] Fix wrong test path under compatible #5567
  • [enhancement][ci][need-simple-ci] Prevent upload logs automatically #5560
  • [enhancement][ci][interface] Add nn.AdaptiveAvgPool1d and nn.AdaptiveAvgPool3d #5445
  • [feature][ci] add speed test in ci #5496
  • [enhancement][ci] Reduce usage of Simple CI #5546
  • [feature][bug][ci][api] Restruct upsample module #5524
  • [feature][ci] multi client launcher test #5488
  • [enhancement][ci] Remove automerge if cuda_new_interface failed #5519
  • [enhancement][ci] Prevent adding subdir in python/test #5514
  • [enhancement][ci] piprepo->pipindex #5517
  • [enhancement][ci] add dynamic_loss_scale in ci tests #5337
  • [enhancement][ci] Add timeout for wait_gpu_slot #5497
  • [enhancement][feature][ci] new static check based on clang-tidy #5476
  • [enhancement][ci] Fix url not downloadable in some browers #5701
  • [feature][ci] multi client multi machine test #5685
  • [enhancement][ci] Add cpu new interface CI #5639
  • [enhancement][ci][need-simple-ci] Mv clangtidy to simple ci #5667
  • [enhancement][ci][need-simple-ci] use clang tidy appimage in ci #5841
  • [enhancement][ci] Use gcc 7 in release to prevent error #5840
  • [enhancement][ci] bn tol 1e-4 => 1e-3 #5811
  • [enhancement][ci] fix distributed run on built dir #5810
  • [enhancement][ci] fix third party mirror check_sum #5802
  • [ci][documentation] find more accurately which files need to be doctested #5782
  • [enhancement][ci] Print stack unconditionally #5779
  • [enhancement][ci][need-simple-ci] Enable more checkers for clang-tidy in CI #5738
  • [enhancement][ci] CI: add clang-tidy check to test.yaml #5920
  • [ci][documentation] fix docstring in oneflow.nn.functional namespace #5807
  • [enhancement][ci] disable TREAT_WARNINGS_AS_ERRORS in Release CI #5886
  • [enhancement][ci] Skip ci jobs by git diff #5863
  • [bug][ci] quick fix #5978 #6030
  • [enhancement][bug][ci] fix clang tidy diff options and file format #5990
  • [enhancement][ci] add flow.relu #5847
  • [enhancement][ci] equal => allclose #6164
  • [bug][ci][need-simple-ci] CI: fix clang tidy checks in simple ci #6161
  • [enhancement][bug][ci][documentation][api] add interpolate and layer_norm docs #6157
  • [bug][ci] update speed test #6113
  • [enhancement][bug][ci][documentation][api] speed import oneflow #6107
  • [bug][ci] Also try install dev deps for CODEGEN_PYTHON_EXECUTABLE #6115
  • [bug][ci][need-simple-ci] set gtest_CMAKE_DEBUG_POSTFIX "d" #6085
  • [enhancement][ci] add cache init file for clang and CI build with clang #6062
  • [enhancement][ci] add emoji in speed test output, make it continue-on-error #6214

Test enhancements:

  • [bug][test][interface] Fix acos ci bug #5217
  • [feature][test] implement automated test #5321
  • [enhancement][test] move generator test into ops folder to accelerate tests #5472
  • [feature][test][api] Add autotest part2 #5467
  • [enhancement][test][api][interface] Add some tests with the new framework for auto testing #5561
  • [bug][test] fix test error when do multi case test on graph #5590
  • [enhancement][test] Refine module test using auto test by yaochi #5484
  • [enhancement][test] Add autotest for BatchNorm2d #5734
  • [enhancement][test] RTH_update_op_test #5823
  • [enhancement][test] dev adamw graph config #5745
  • [feature][test][api][interface] Add new autotest #5562
  • [bug][test] restore test of alexnet graph #5798
  • [enhancement][test][interface] add zhangshen op-test #5600
  • [feature][bug][tooling][test][interface] Record autotest wrong code #5923
  • [enhancement][feature][test][api] add randint #5718
  • [bug][test] fix multi machine test #5984
  • [enhancement][test][interface] some op test #6095

Tooling enhancements:

  • [bug][tooling] user/summary: fix memory leak in FillImageInSummary #5742
  • [enhancement][tooling][cfg] cfg: add move assignment operator for performance #5962
  • [enhancement][tooling][api][refactor] refactor_all_device_placement_api #6080
oneflow - v0.3.0

Published by jackalcooper about 3 years ago

oneflow -

Published by jackalcooper about 3 years ago

Changelog

v0.5.0b1 (13/09/2021)

Highlights

  • First class support for eager execution. The deprecated APIs are moved to oneflow.compatible.single_client
  • Drop-in replacement of import torch for existing Pytorch projects. You could test it by inter-changing import oneflow as torch and import torch as flow.
  • nn.Module for eager execution
  • nn.Graph for lazy execution
  • DDP for data parallel

A sneak peek of the new API

Here is a minimum example showcasing how to incorporate a nn.Module in a nn.Graph and have it run in lazy mode.

class NeuralGraph(flow.nn.Graph):
    def __init__(self, ...):
        super().__init__()
        self.model = model # model is a nn.Module instance

    def build(self, x):
        y_pred = self.model(x)
        return y_pred

graph = NeuralGraph() # to create a nn.Graph instance
y_pred = graph(x) # to run the created nn.Graph

New in Python API

  • [feature][eager][op][test][python][interface] Add test for convtranspose2d #5239
  • [enhancement][python][interface] Add GroupNorm #5175
  • [enhancement][eager][python][interface] [Add] avgpool1d avgpool3d #5165
  • [feature][eager][op][python][interface] Add deconv cpu impl #5224
  • [bug][eager][api][python][interface] Fix acosh bug #5221
  • [feature][eager][op][python][interface] Dev modules ctc loss #5168
  • [bottleneck][bug][documentation][python][interface] Fix meshgrid test bug #5208
  • [eager][documentation][python][interface] Rename CosineScheduler to CosineAnnealingLR #5112
  • [feature][eager][python][interface] Add meshgrid module #5205
  • [enhancement][feature][bug][op][python] support bias in conv2d's parameter list #5322
  • [eager][documentation][api][python][interface] add not_equal, greater_equal and less_equal module #5350
  • [enhancement][eager][python] refine pow module and its test #5319
  • [enhancement][eager][op][python] Add triu op #5329
  • [enhancement][bug][python] Fix optimizer for not supporting all kinds of iterables #5355
  • [bug][python][interface] raise IndexError in get_canonical_index to support for loop #5345
  • [bug][python][interface] tensor slice assign supports broadcasting #5344
  • [enhancement][op][python] add cpu group conv logic #5314
  • [enhancement][python] Add 'nn.Mish' module and corresponding functions #5310
  • [enhancement][build][python] Remove ONNX from setup py #5297
  • [enhancement][python][interface] [add] zeropad2d #5278
  • [feature][system][python][interface] Lazy nn.Graph FeedInputOpExpr #5458
  • [feature][python][interface] integrate nn.image.flip #5411
  • [bug][python] Fix issues in point of MultiClientSession #5469
  • [enhancement][bug][python] update HasAllMultiClientEnvVars() #5459
  • [enhancement][python] Add in_top_k function #5428
  • [enhancement][python] Dev add docstring #5449
  • [feature][api][python] MultiClientSession #5407
  • [documentation][python] remove --user #5431
  • [feature][python][interface] nn.Graph python #5309
  • [feature][python][interface] Fea/nn graph/graph name #5413
  • [bug][python][interface] rm nn.Graph.train #5424
  • [op][documentation][api][python][interface] add bernoulli module #5353
  • [enhancement][python] flow.S/B/P #5306
  • [enhancement][documentation][python] Add instruction on upgrade pip #5400
  • [enhancement][python] Rm oneflow export and experimental #5589
  • [bug][python] Fix nn.graph.utils module conflict #5598
  • [feature][ci][python] Update autotest framework #5520
  • [enhancement][python] copy of_proto_python_dir to compatible_single_client_python #5539
  • [enhancement][api][python] del default env init #5537
  • [enhancement][python] Fix single client using same glog file #5535
  • [bug][api][python] Fix Session TryClose #5531
  • [enhancement][feature][python] split vector-matrix norm #5478
  • [feature][eager][op][python][interface] Add more upsample kernel #5382
  • [enhancement][feature][test][python] add torchstyle unittest #5489
  • [feature][system][python] nn.Graph with training #5662
  • [enhancement][feature][python] Fea/nn graph/block proxy func #5727
  • [enhancement][api][python] consistent_tensor_to_api #5703
  • [feature][eager][op][python] Dev Align torch avgpool #5610
  • [enhancement][python] fix circular deps of sbp python module #5706
  • [documentation][python] [part5]Remove singleclient outdated api #5674
  • [enhancement][python] [part4]Remove singleclient outdated api #5672
  • [bug][op][python] remove outdated code in conv3d #5696
  • [enhancement][test][python] enlarge tolerance of dataloader test #5689
  • [enhancement][test][python] add autotest for some math ops #5646
  • [feature][python] nn.Graph optimizer part 2: add L2, pass job complete, refactor #5604
  • [enhancement][python] Add clip_grad_norm #5299
  • [purge][python] Remove Single-Client API in oneflow default python #5827
  • [bug][python] Fix ddp grad size #5834
  • [enhancement][feature][python] Dev RMSprop graph conf #5768
  • [enhancement][purge][eager][python] remove scale arg in optimizer #5821
  • [enhancement][feature][python] graph/block io check #5803
  • [enhancement][feature][python] Dev adam graph conf #5709
  • [purge][python] [part10]Remove singleclient outdated api #5756
  • [feature][api][python] better repr of nn.Graph for debug #5762
  • [bug][python] fix weight decay in RMSprop #5755
  • [purge][python] [part9]Remove singleclient outdated api #5752
  • [purge][python] [part8]Remove singleclient outdated api #5750
  • [documentation][python] add first batch of methods in oneflow.nn.functional namespace #5693
  • [purge][python] [part6]Remove singleclient outdated api #5704
  • [bug][python] use default_generator.seed() as random_seed in init #5721
  • [bug][system][python] ddp broadcast params and buffers #5913
  • [enhancement][test][python] Add consistent tensor requires grad test #5925
  • [bug][python] wrap flow.nn.init.* with flow.no_grad() #5932
  • [feature][api][python][interface] add clip_grad to optimizer #5817
  • [enhancement][ci][op][test][python] add randperm with test and docs #5680
  • [feature][api][python] Fea/nn graph/ lr_schedule(and cosine lr_sch) and opt_group #5846
  • [bug][python] fix bug of SyncOnMasterFn atexit #5909
  • [purge][python] Delete single client nn modules #6061
  • [enhancement][python] Move framework.distribute to env #6022
  • [bug][python] skip sync when abnormally exiting #6025
  • [feature][python] Fea/nn graph/warmup amp config #5969
  • [documentation][python] add optimizer api docs #6131
  • [documentation][python] add_tensor_api_doc #6127
  • [bug][python] Fix test_grid_sample.py and test_affine_grid.py threshold #6125
  • [documentation][api][python] add doc of graph #6093
  • [bug][python] Fix make of_format fail in ubuntu #6120
  • [feature][api][python][interface] Fea/graph helpers #6088
  • [enhancement][eager][python][interface] Use flow.randint in dataloader #6086
  • [feature][eager][api][python][interface] Import oneflow as torch #6076
  • [enhancement][test][api][python][refactor] rename OfrecordReader to OFRcordReader #6090
  • [purge][python][need-single-client-tests] Delete single client nn modules #6082
  • [enhancement][python] flow.load tolerates FileNotFound fault #6083
  • [feature][python] Fea/pipeline in graph #6105
  • [enhancement][test][python] graph activation checkpointing #6192
  • [enhancement][feature][op][python] rnn test #6165

New in Ops:

  • [enhancement][op][api][refactor] [Functional] Part2: Add partial unary and math functional apis #5218
  • [enhancement][bug][op][interface] Refine deconv kernel #5229
  • [enhancement][op][api][interface] add ReflectionPad2d #5172
  • [feature][eager][op][api][interface] crossentropyloss and nllloss support ignore_index #5195
  • [feature][eager][op][api][interface] Yejiaojiao/dev bcewithlogitsloss #5173
  • [bug][ci][op] Dev user op set default is_dynamic #5223
  • [enhancement][op] add magic method for pow #5199
  • [enhancement][op][interface] add cpu version of upsampling #5194
  • [enhancement][bug][op][api][interface] add ReplicationPad2d #5148
  • [feature][eager][op][api][interface] add kldivloss module #5155
  • [feature][eager][op][documentation][build][api][interface] Add floor module and the corresponding testcases #4964
  • [enhancement][feature][op] Dev conv1d module #5280
  • [enhancement][op] Add ctc_greedy_decoder op #5294
  • [enhancement][op][system] Dev remove default grad func #5320
  • [enhancement][op][system] Add pad grad func. #5354
  • [enhancement][op][system] Add gradient funcs. #5348
  • [feature][purge][bug][eager][op][interface] fix upsample nearest bug #5347
  • [enhancement][op][system] [Functional] Part7: Migrate pooling ops #5253
  • [enhancement][op] nvjpeg hardware acc #5240
  • [enhancement][feature][ci][eager][op][api][interface] Add bmm module #5334
  • [enhancement][eager][op] Dev image decode eager #5333
  • [enhancement][op] Optimize softmax warp impl #4977
  • [enhancement][eager][op] Dev tensor buffer eager #5317
  • [enhancement][op][api][refactor] [Functional] Part6: Migrate conv op #5252
  • [enhancement][eager][op] Dev sort eager #5284
  • [enhancement][bug][op][api] fix bceloss bug in default weight and reduction #5303
  • [bug][eager][op] remove redundant assert and check #5264
  • [enhancement][bug][ci][op] fix bceloss bug about weight #5269
  • [enhancement][op][api][refactor] [Functional] Part5: Migrate nn ops #5249
  • [enhancement][eager][op] Dev argsort eager #5273
  • [enhancement][op][api][refactor] [Functional] Part4: Migrate array ops #5247
  • [enhancement][op][api][refactor] [Functional] Part3: Migrate binary and activation ops #5246
  • [bug][ci][op][test] Dev fix rmsprop ci fail #5481
  • [enhancement][op] add inplace method: Tensor.sin_ #5471
  • [bug][op] hotfix image_batch_align #5461
  • [enhancement][eager][op][interface] Dev maxpool series op 123d #5244
  • [bug][op] fix pool gpu kernel #5446
  • [feature][eager][op][api][interface] add pixelshufflev2 module #5383
  • [enhancement][feature][ci][eager][op][documentation][api][interface] Add flow xxx and tensor xxx autotest #5386
  • [enhancement][feature][eager][op][api][interface] Modules chunk #5324
  • [enhancement][eager][op] add image normalize for eager #5402
  • [enhancement][eager][op] Dev batch align module #5401
  • [enhancement][eager][op] add coco reader module #5391
  • [enhancement][wip][op] Restruct Elementwise kernel #4130
  • [bug][op] Fix DecodeRandom reuse mem #5606
  • [enhancement][op] Align pytorch maxpool #5525
  • [enhancement][bottleneck][eager][op][api] implementation of constantpad-3d op #5529
  • [enhancement][eager][op] Add scale size for resize #5509
  • [enhancement][op][api][refactor] Dev optimize tensor setitem #5501
  • [enhancement][op] register uint8 dtypeto support dataloader #5499
  • [enhancement][op] Add unique.cuh #5487
  • [enhancement][op][api][interface] Dev ofrecord auto truncating #5412
  • [feature][op][system][interface] Feat: LazyInterpret::ApplyImpl support SourceUserOpExpr and Copy #5711
  • [enhancement][op][interface] Dev logical_and/or modules #5636
  • [enhancement][op] support any number positional arguments for ones and zeros op #5698
  • [enhancement][feature][eager][op] Add conv3d Module #5327
  • [feature][eager][op][api][interface] add batchnorm3d module #5631
  • [bug][eager][op] fix reduce min max backward bug #5651
  • [enhancement][op] Debug dim scatter #5371
  • [enhancement][op][interface] Dev eye #5583
  • [enhancement][eager][op] Dev minimum maximum #5576
  • [enhancement][op] Restruct activation grad op #5669
  • [enhancement][feature][eager][op] Rewrite activation function #5465
  • [bug][op][documentation] add oneflow.cat for documentation #5621
  • [enhancement][op] Lcy logsoftmax #5746
  • [feature][op][need-simple-ci] Feat empty op #5659
  • [enhancement][eager][op] Dev split #5714
  • [enhancement][op][interface] add index_select op #5661
  • [bug][op] fix nvjpeg hw acc #5851
  • [enhancement][op] Remove move in conv_cudnn #5828
  • [enhancement][op][interface] Dev logical_xor module #5694
  • [bug][eager][op] fix squeeze #5808
  • [enhancement][op] Get parallel_id and parallel_num through rank and world size in DDP #5717
  • [bug][eager][op] delete interpolate int type #5805
  • [bug][op] Fix bug in scatter #5743
  • [enhancement][op] Refactor: remove module not required, call function directly #5754
  • [enhancement][op] Remove modules not required(tan, erfc, log1p, scatter_nd) #5791
  • [enhancement][op] Refactor scatter, clamp and pow in cpp instead of in python #5715
  • [enhancement][op] Rm useless code in gather files #5687
  • [enhancement][eager][op] change flip_code to scalar #5786
  • [enhancement][bug][op][interface] fix upsample bug #5753
  • [bug][op][interface] Quick fix Lazy nn.Graph input/output OpConf.BlobConf.is_dynamic #5767
  • [enhancement][bug][eager][op] fix argwhere 0-dim bug #5760
  • [enhancement][eager][op] delete unused code #5744
  • [feature][op] Export fused_scale_tril op #5933
  • [bug][op] Fix backward bug in 3d #5908
  • [bug][op] Fix one_hot api limit #5927
  • [enhancement][eager][op] Dev where scalar #5797
  • [bug][op] fix grad error #5914
  • [feature][bug][op] Fix inplace op circle reference bug #5910
  • [enhancement][op] Move the judgment content to c++, And add scalar fmod #5854
  • [enhancement][op] Support combined_margin_loss op in flow.nn.modules #5830
  • [enhancement][op][api][interface] functional_one_hot #5315
  • [enhancement][op] Dev scalar op #5778
  • [bug][eager][op] fix gather kernel 0 shape #5888
  • [enhancement][op] add l2_normalize for mutl-client interfaces #5859
  • [feature][op] Export function softmax_cross_entropy #6056
  • [enhancement][op] Add int attr for functional adaptive average pool #6059
  • [enhancement][op][interface] dev full op #5955
  • [bug][eager][op] fix 0dim inplace add #6029
  • [feature][op][system][interface] Feat: nn.Graph image gpu decoder #6014
  • [enhancement][op][interface] dev optim_optim_lr_scheduler_multisteplr #5975
  • [enhancement][op] NopKernel #6035
  • [enhancement][eager][op][api] Dev tril op #6005
  • [enhancement][op] dev unfold and fold #5675
  • [enhancement][op] ResNet CUDA Graphs #6018
  • [enhancement][feature][op] add broadcast pow #6013
  • [enhancement][op][interface] init of op diag #5298
  • [op][documentation][api] Fix api document bug #6009
  • [enhancement][op] Dev fused functional #5954
  • [bug][op][build] Add nvcc flag -Werror cross-execution-space-call #6002
  • [bug][op] Fix Normalization grad function #5993
  • [enhancement][feature][eager][op][test][interface] Add fused self attention #5966
  • [enhancement][bug][ci][eager][op][api][interface] Try to fix var bug #5973
  • [enhancement][feature][eager][op][interface] add prod op #5867
  • [enhancement][eager][op][api] add glu op #6065
  • [enhancement][op] Align Torch.nn.functional poolXd #6184
  • [bug][eager][op] fix backward index for gamma beta #6149
  • [bug][op][system] Fix BroadcastMatmulGrad bug #6168
  • [enhancement][op][api] Add Int support for functional.avg/maxpool #6174
  • [bug][eager][op][api][interface] align dropout api name with pytorch #6170
  • [enhancement][op] support inplace operation for hardsigmoid #6137
  • [enhancement][bug][op] Fix do bias correction in Adam/AdamW #5960
  • [bug][eager][op][api][interface] fix repeat 0-dim tensor bug #6150
  • [enhancement][bug][op] Fix select_first_grad bug #6142
  • [bug][ci][eager][op][documentation][interface] Add clipgrad doc and contiguous #6130
  • [bug][op] Fix eager optim dynamic attr bug #6111
  • [enhancement][op] Support grid_sample and affine_grid operator #6038
  • [op][documentation] Export apis for documentation #6068
  • [enhancement][feature][bug][ci][eager][op][documentation][interface] transfer python function to c++ method #6114
  • [op][documentation] Dev functional batch_gather #6233
  • [enhancement][op][test] fix cross_entropy_loss and its test #5799
  • [bug][op] Use attr nd_sbp to check consistent #6222
  • [enhancement][op] Dev fused bn functional #6077
  • [enhancement][op] support default value in intlist #6201
  • [bug][op] fix sparse_softmax get_nd_sbp #6203
  • [bug][op] Fix bug in model fused update #6197
  • [enhancement][op][system][refactor] Optimize tensor getitem. #5433

New in Eager:

  • [enhancement][eager][interface] Reconstruct module files #5251
  • [bug][eager][documentation][interface] Fix conv module bug #5245
  • [bug][ci][eager][interface] Fix bce withlogitloss ci error #5237
  • [feature][eager][api][interface] module BCELoss #5144
  • [enhancement][feature][eager][api][interface] Dev norm op #5178
  • [enhancement][bug][eager] Fix stack module #5222
  • [enhancement][feature][eager] Support different dtype of equal module #5214
  • [enhancement][bug][eager][documentation][api][interface] Add nllloss backward #5210
  • [enhancement][eager][api][upload-core] Decouple FileSystem and IOConf #5162
  • [enhancement][ci][eager] Set lower precision avoid ci failing #5200
  • [eager][documentation] Add hint when apply FunctionNode second time #5369
  • [enhancement][feature][bug][ci][eager][documentation][api] Fix upsample bilinear bug #5366
  • [bug][eager] Fix not contiguous ndarray to tensor bug #5351
  • [enhancement][eager][system] Infer consistent tensor meta #5118
  • [feature][eager] Feat graph autograd engine #5296
  • [enhancement][eager][interface] Dev type as module #5349
  • [feature][eager][documentation][api][interface] Add new ones module #5342
  • [enhancement][bug][eager] Fix logical slice assign dtype #5339
  • [bug][ci][eager][documentation][api][interface] Fix where module bug #5300
  • [bug][ci][eager][documentation][api] Fix l1loss ci error #5307
  • [enhancement][bug][eager][documentation][api][interface] Qi's First Edit of deleting "print" and ".numpy" #5129
  • [feature][eager][refactor] Separate autograd meta to tensor #5267
  • [feature][eager][api][interface] add tile module #5234
  • [enhancement][eager] Release lambda function to reuse tensor memory #5266
  • [feature][bug][eager][documentation] Fix default value not set bug #5483
  • [enhancement][eager][interface] [Add] gather_nd scatter_nd #5422
  • [enhancement][bug][eager] fix param #5473
  • [bug][eager] Fix Tensor.grad setter bug #5462
  • [enhancement][eager] Rename now_grad_arg to current_grad #5466
  • [eager][test][documentation][interface] Add autotest part1 #5436
  • [enhancement][eager] Use functional copy instead of op_builder #5460
  • [bottleneck][bug][eager][interface] fix -1 index not support bug #5448
  • [bug][ci][eager][documentation][api] Fix concat backward bug #5443
  • [enhancement][bug][ci][eager] Add autograd engine warning #5444
  • [feature][eager][api][interface] Smoothl1loss #5256
  • [enhancement][bottleneck][eager] remove device dtype params #5434
  • [bug][ci][eager][documentation][interface] Delete maxpool failed test #5409
  • [enhancement][eager][api] Add tensor grad assginment #5379
  • [enhancement][bug][eager] fix-abs #5398
  • [enhancement][bug][eager][interface] Fix bn track running stats #5393
  • [enhancement][bug][eager] Support uint dtype of constant op #5396
  • [enhancement][bug][eager][documentation][interface] Delete useless code upsample #5392
  • [enhancement][ci][eager][interface] add flow.view #5301
  • [enhancement][bug][ci][eager][api][interface] Add masked select module #5356
  • [bug][eager][interface] Fix batchnorm backward bug #5602
  • [enhancement][eager] Support weight_dacay(l2 actually) #5587
  • [feature][eager][documentation][api] Add new autotest #5588
  • [enhancement][eager][documentation][api] Dev fmod #5404
  • [feature][eager] Support inplace add #5432
  • [feature][eager][interface] Feat tensor stride property #5543
  • [enhancement][feature][eager][documentation][api] Add flip module #5541
  • [feature][eager] Feat module repr #5486
  • [enhancement][bottleneck][bug][eager][interface] Fix maxpool1d params #5493
  • [enhancement][feature][eager][interface] Dev flow.utils.data part1 #5406
  • [bug][eager][api] Fix tensor getitem bug #5474
  • [enhancement][eager][need-simple-ci] export datasets interface #5691
  • [enhancement][eager][system] rebase #5601
  • [enhancement][eager][test] added nn.RecordBytesDecoder with its test #5475
  • [enhancement][feature][eager][need-simple-ci] 0-dim tensor support #5552
  • [enhancement][bug][eager] rewrite slice_update backward #5677
  • [enhancement][bug][eager][interface] align view input style with torch #5676
  • [enhancement][eager][interface][need-simple-ci] add autotests for modules #5666
  • [enhancement][bottleneck][eager][interface] Dev constantpad1d op #5579
  • [enhancement][eager][api][interface] Restruct MathOps AutoTest #5654
  • [enhancement][bug][ci][eager] Fix flip bug #5657
  • [bug][eager][api][interface] Fix expand module bug #5650
  • [enhancement][bug][eager][documentation][api] Fix repeat bug #5633
  • [enhancement][eager][test][api][interface] Add new autotest #5617
  • [enhancement][eager][api][interface] Dev flow.utils.data part2 #5500
  • [enhancement][bug][eager] make setitem device match #5835
  • [bug][eager][api][interface] align reshape input param with pytorch #5804
  • [feature][bug][eager][api] Align where op with torch #5850
  • [enhancement][bug][eager][api] Restruct prelu op #5829
  • [bug][eager][need-simple-ci] fix pooling ceil_mode bug #5818
  • [enhancement][eager] stateful local kernel supports consistent #5789
  • [bug][eager][api][interface] Fix argwhere bug #5816
  • [enhancement][eager][documentation][api] dev-nonzero #5809
  • [enhancement][feature][eager][api] Add fake quantize op #5690
  • [enhancement][bug][eager][documentation][api] Add api #5663
  • [enhancement][eager] Refactor consistent infer result #5790
  • [bug][eager][need-simple-ci] skip dataloader test #5780
  • [bug][eager][need-simple-ci] fix 0-dim tensor.fill_ #5771
  • [enhancement][eager] Cpu mpi broadcast #5726
  • [feature][eager] Feat grad mode classes #5956
  • [enhancement][bug][eager] fix wrong names #5951
  • [enhancement][eager][system] Local dep object pool #5953
  • [enhancement][eager][interface] rename OpExprInterpState to AutoGradCaptureState #5918
  • [bug][eager] Fix linear bug #5945
  • [bug][eager] Fix tensor_meta update bug #5924
  • [enhancement][eager] use flow.randperm #5928
  • [enhancement][eager] consistent init/save/load #5896
  • [enhancement][bug][eager][documentation][interface] Restruct sort and argsort op #5911
  • [enhancement][bug][eager][interface] Try to fix the problem that the insightface cannot converge。 #5906
  • [enhancement][bug][eager][interface] Add autotest #5899
  • [enhancement][eager] The scheduler thread joins worker threads #5893
  • [enhancement][eager] Bugfix async callback #5881
  • [feature][eager] Feat tensor to bool #5836
  • [bug][eager] Remove inplace broadcast_add #5551
  • [enhancement][eager] Broadcast consistent shape and dtype #5784
  • [enhancement][eager] Fix optimizer list parameters input bug #5848
  • [enhancement][eager][interface] Dev flow.utils.data part3 #5644
  • [enhancement][eager][api] Normalize naming of modules #6066
  • [enhancement][feature][eager][api][interface] add truncnormal #6051
  • [enhancement][bug][eager] AutoMatedTest support test module.parameter.grad #6043
  • [enhancement][feature][bug][eager] add module call kwags #6069
  • [enhancement][eager][api][interface] add tensor.item tensor.tolist #6021
  • [enhancement][eager][api][interface] Export pool ops api #6047
  • [enhancement][bug][eager][test][documentation][interface] Add more autotest sample #6039
  • [enhancement][bug][eager][system] disable cuda_h2d stream #6020
  • [feature][eager][test][api][interface] Add autotest codegen #6019
  • [feature][eager][documentation] Refactor cosine lr scheduler #6000
  • [enhancement][eager][interface] tensor.cpu/tensor.cuda #5894
  • [enhancement][eager][api] Support consistent_tensor.to(dtype) #5991
  • [bug][eager][interface] remove redundant codes in ModuleDict #5961
  • [bug][eager] Fix LayerNorm check bug #6196
  • [enhancement][eager][api] Change dropout api #6182
  • [enhancement][good for pr][eager][api][interface] add: test convert dependency #6023
  • [enhancement][bug][eager][interface] Fix autotest codegen bug #6171
  • [bug][eager] restore instr_local_dep_object_pool_size for nccl #6160
  • [enhancement][eager][api][interface] Aligin pooling op functional api names with torch #6163
  • [feature][bug][eager][api][interface] delete file #6162
  • [bug][eager] Fix optim load_state_dict bug #6152
  • [enhancement][eager][api] add is_training to dropout functor #6148
  • [enhancement][eager] Decompose nd sbp boxing #5800
  • [enhancement][eager] support consistent_tensor.to(copy=True) #6122
  • [feature][eager] Static grad scaler #6135
  • [bug][eager] Fix LayerNorm expr bug #6121
  • [bug][eager][api] move numpy c api init in numpy.cpp, make np array contiguous before copying #6117
  • [enhancement][eager][refactor] Remove params from ParamGroup getitem #6096
  • [enhancement][feature][eager] Support tensor and optimizer serialization #6087
  • [enhancement][bug][eager] fix bug about tensor str in nonsymmetric cast and getitem in consist… #6239
  • [enhancement][eager] Cpu all reduce #5849
  • [feature][eager] Support assign copy interface #6228
  • [enhancement][eager][api][interface] Dev reconstruct pad ops #6223
  • [enhancement][eager][api][interface] support flow.cuda.is_available #6124
  • [bug][eager] make flow._C.local_all_reduce sync lanuched #6175
  • [enhancement][eager] Rename flow to oneflow in user hint #6190
  • [bug][eager][tooling][test][api][interface] Autotest generate input tensor #6206
  • [enhancement][eager] consistent tensor zeros_() #6202
  • [enhancement][eager] Cpu mpi #5865

Build enhancements:

  • [bug][build] Fix GRPC compilation failure on CMake 3.20 #5255
  • [bug][build] Refine header file copy #5254
  • [bug][build] Fix older version CMake doesn't support multiple targets in CLI #5248
  • [bug][build] Turn off NCCL_STATIC/CUDNN_STATIC when CUDA_STATIC is OFF #5243
  • [feature][build] Fix support for Ninja and add Ninja build in Simple CI #5236
  • [enhancement][build] Add cmake option CUDA_STATIC #5164
  • [bug][build] Fix protobuf debug postfix #5233
  • [enhancement][ci][build] Move default third party dir into build dir #5230
  • [enhancement][build] Refine protobuf cmake #5216
  • [enhancement][ci][build] Remove transport test main #5215
  • [enhancement][ci][build] Speedup opencv build #5213
  • [enhancement][build] Support clang #5015
  • [enhancement][documentation][build] Add prefix when creating git archive #5201
  • [enhancement][build] Add cmake option NCCL_STATIC #5160
  • [enhancement][build] Refine CMake CUDA version handling #5192
  • [enhancement][build] Use clang plugin to check Maybe variables are used #5358
  • [enhancement][build] Add BUILD_BYPRODUCTS for ExternalProject_Add #5316
  • [enhancement][build] Add cmake init cache to simplify user onboarding #5311
  • [feature][bug][build] Fix macOS support and run macOS build in Simple CI #4947
  • [enhancement][build] flatbuffers use mirror #5295
  • [enhancement][build] Don't build test by default #5302
  • [enhancement][build] Prevent building from scratch when toggle flag BUILD_GIT_VERSION #5259
  • [enhancement][build] Refine gRPC, glog, gflags cmake for conda #5276
  • [feature][build] Support XLA with CPU-only #5260
  • [enhancement][ci][onnx][build] Remove ONNX from CI #5257
  • [enhancement][build] Refactor build_wheel to support oneflowinc images #5427
  • [enhancement][build] Add arg skip_audit in build wheel #5423
  • [bug][build] hwloc disable shared #5388
  • [documentation][build] Update readme for autoconf and libtool #5376
  • [enhancement][build] remove dir python and compatible_single_client_python #5609
  • [bug][build][system] Fix pyyaml version #5594
  • [enhancement][ci][build] force release flags #5574
  • [bug][build] prevent endless loop #5534
  • [enhancement][build] Support sccache #5528
  • [enhancement][build] Add definition for CMAKE_BUILD_TYPE and print cmake_build_type in oneflow doctor #5505
  • [enhancement][ci][build][need-simple-ci] Fix macOS for recent changes #5705
  • [bug][build] fix return type error on gcc 4.8.5 #5660
  • [enhancement][build] Check CMAKE_BUILD_TYPE #5656
  • [enhancement][build] add -Werror=return-type #5655
  • [enhancement][build] Clean and fix for new py dir #5618
  • [enhancement][build] cmake: disable array-bounds check & treat warnings as errors for pyextobj and oneflow_internal & fix warnings #5838
  • [bug][build] set CMAKE_BUILD_TYPE to Release if undefined #5842
  • [enhancement][build][need-simple-ci] Fix all warnings & Add option TREAT_WARING_AS_ERROR to cmake #5751
  • [enhancement][build] add CMAKE_INTERPROCEDURAL_OPTIMIZATION in fast cmake cache #5970
  • [enhancement][build] add clang tidy target #5957
  • [bug][build] cmake: fix cmake cache args in opencv #5959
  • [enhancement][build] Add cmake option USE_SYSTEM_NCCL #5897
  • [enhancement][build] cmake: include third party headers as system headers to avoid warnings #5879
  • [enhancement][build] Ignore opencv-python on machine aarch64 #5884
  • [enhancement][build] enable CMake first class cuda support #5858
  • [bug][build] Fix compile warning (strict-aliasing) #5872
  • [enhancement][bug][build][need-simple-ci] Upgrade gtest and fix some errors raised by clang #6079
  • [bug][ci][build] cmake: fix ninja build in CI #6072
  • [bug][build] fix files not actually removed when building for multiple python versions #6060
  • [bug][build][api] functional_api: fix build error in mac os #6010
  • [bug][build][need-simple-ci][need-single-client-tests] Fix recompile from scratch #6036
  • [bug][build] Turn on NVCC's warnings #6011
  • [bug][build][need-single-client-tests] fix bundle .so of other python version #6034
  • [bug][ci][build][need-single-client-tests] use copy_all_files_in_dir to replace copy_files #6033
  • [enhancement][build] check compiler version in cmake #6026
  • [enhancement][build] Add CUDA_NVCC_THREADS_NUMBER #6017
  • [enhancement][build][need-simple-ci] optimize of_include_copy #5978
  • [enhancement][ci][build][need-single-client-tests] CI: remove -DTREAT_WARNINGS_AS_ERRORS=OFF #6008
  • [enhancement][build][xla] xrt: fix all warnings #5915
  • [enhancement][build] Prevent opencv compile failure with std 17 #5997
  • [enhancement][build] Use bundled cub #5998
  • [enhancement][ci][build] update clang tidy diff warnings-as-errors option #5989
  • [enhancement][build] Update run_clang_tidy.py to set return code and add warning-as-errors #5977
  • [enhancement][build] check: fix clang-tidy-diff commands #5972
  • [bug][build] Suppress NVCC warning #177-D #6094

XLA enhancements:

  • [bug][xla] Make the blob header memory aligned. #5286

System:

  • [enhancement][system] Refactor Memory Zone #5072
  • [enhancement][system] Add interface InferContext::OutputTensorDesc #5219
  • [bug][system] Lazy construct functor to make sure that the operators has already been registered. #5225
  • [enhancement][system] Refactor infer ctx output isdynamic #5220
  • [enhancement][system] Refactor infer ctx input isdynamic #5211
  • [enhancement][system] Wake up the heartbeat thread immediately #5081
  • [enhancement][system] Fix xla test case fail #5203
  • [enhancement][system] Add interface InferContext::InputDType #5153
  • [purge][system] delete const_cast in Output #5196
  • [feature][system] Add hwloc for topology detection #5291
  • [enhancement][system] fix registry may segment #5336
  • [enhancement][system] Use functional api instead of op_expr_helper::XXXOp. #5364
  • [enhancement][system] move btob to op #5274
  • [documentation][system] Add Latest News section in README #5361
  • [enhancement][bug][system] fix dropout module: return directly if not training #5346
  • [bug][system] add missing JUST #5357
  • [documentation][system] Add more communication outlets on README #5359
  • [enhancement][feature][system] CommNet dynamic register memory #5281
  • [enhancement][system] Use symbol device #5341
  • [enhancement][system] fix multithread bug in env #5283
  • [bug][system][api] fix bug in cfg_replacement #5335
  • [bug][system] Fix create log directory thread-unsafe #5326
  • [bug][system] fix_bug_in_make_parallel #5328
  • [enhancement][system][cfg] replace train_conf, job_conf using cfg::xx #5263
  • [enhancement][system][quantization] support tensorrt in qat #5287
  • [enhancement][system][api] Export functional apis for oneflow.experimental. #5313
  • [enhancement][system] fix bug check between cfg enum and proto enum #5285
  • [enhancement][system] replace CHECK_EQ using CHECK_EQ_OR_RETURN #5279
  • [enhancement][system] Refactor SbpXXX to cfg::SbpXXX #5120
  • [enhancement][system][api] add detach for LazyMirroredtensorImpl #5270
  • [enhancement][system] shorten XXIsDynamic4ArgNameAndIndex to be xxIsDynamic #5265
  • [enhancement][system][cfg] job_config to cfg #5235
  • [feature][system] Multi-Client LogicalRun degenerate to PhysicalRun #5479
  • [enhancement][system] fix ConstructOp without JUST #5480
  • [enhancement][system] Output arg modifier return maybe part 1 #5451
  • [feature][system][interface] Fea/nn graph/graph build ctx #5420
  • [enhancement][system] Throw exception if check failed #5457
  • [feature][system] multi client launch #5372
  • [enhancement][system][api] Optimize reduce mean #5452
  • [enhancement][system] export Tensor only to python #5440
  • [enhancement][system] Output arg modifier return maybe part_0 #5447
  • [enhancement][system] ThreadMgr support AddPlan #5450
  • [enhancement][system] Refactor infer ctx input tensordesc #5226
  • [enhancement][system][api] instruction builder return maybe #5442
  • [feature][system][interface] MultiClientSessionContext #5421
  • [enhancement][feature][system] add launcher, update multi client launch and exit #5414
  • [purge][system][refactor] Remove IOConf #5419
  • [enhancement][system] Dev refine generator #5426
  • [enhancement][system] Support inplace operations #5204
  • [enhancement][system][refactor] Dev refactor generator #5397
  • [enhancement][system] Add new placement init func #5408
  • [enhancement][system] NNGraphIf #5387
  • [enhancement][system][refactor] Cast explicitily in unpack call to avoid confilt with Optional. #5380
  • [enhancement][system][interface] [Random Generator] Part2: Migrate functional dropout #5378
  • [enhancement][system] replace ForeignJobInstance using JobInstance #5374
  • [enhancement][system][refactor] Speedup reshape module by 5x. #5381
  • [feature][system][interface] [Random Generator] Part1: Dev random generator #5360
  • [enhancement][system] Add ONEFLOW_STREAM_CUDA_EVENT_FLAG_BLOCKING_SYNC #5612
  • [enhancement][system] [part2]Remove singleclient outdated api #5568
  • [feature][system][interface] nn.Graph call and launch impl #5580
  • [enhancement][system] remove outdated doctest api and "@experimental_api" #5564
  • [feature][system][interface] Register ForeignCallback and Watcher in Multi-Client #5591
  • [enhancement][system] [Part-1]remove outdated api and files of multi-client on master branch #5556
  • [feature][system][interface] LazyInterpret build LocalTensor if input is local #5582
  • [enhancement][system] add job_pass MultiClientAutoSourceAndSinkTick #5507
  • [feature][system] Fea/nn graph/optimizer #5533
  • [feature][system][interface] New/CloseRuntimeBuffers and RunLazyJob impl #5571
  • [feature][system][refactor][interface] NNGraph interface and implement for CompileAndRuntime #5558
  • [feature][system] Fea/nn graph/forward graph #5516
  • [enhancement][system] Lazy job stream type #5389
  • [enhancement][system] Refactor single client autotick #5506
  • [enhancement][system] replace underline using dot in single client #5547
  • [bug][system] fix return type #5548
  • [feature][system][interface] LazyInterpret for UserOpExpr #5544
  • [enhancement][system] Add ProfilerStart/ProfilerStop API #5542
  • [feature][system][interface] LazyInterpreter for FetchOutputOpExpr and set op parallel_distribution #5527
  • [enhancement][system] Multi client push pull #5492
  • [enhancement][system] registry_callback_fn return maybe #5456
  • [enhancement][system] bw_gen_fn return maybe #5455
  • [enhancement][system] gen_bw_fn return maybe #5454
  • [enhancement][system] Compatible single client #5417
  • [feature][system][interface] GlobalMultiClientEnv and refine EagerExecution #5523
  • [enhancement][system] Job pass maybe system #5503
  • [enhancement][system] Remove Plan::net_topo #5502
  • [feature][system][interface] LazyInterpret for FeedVariableOpExpr #5490
  • [enhancement][system] Input arg modifier return maybe #5453
  • [feature][system][interface] Fea/nn graph/block scope #5498
  • [feature][system] jit_fuse_cast_scale #5332
  • [enhancement][system] Remove obsolete Profiler #5747
  • [enhancement][system][api] Dev fix batch norm not stats #5733
  • [enhancement][system] rename rpc_token to TransportToken #5735
  • [enhancement][system][api] Refacotr maximum minimum py2cpp #5724
  • [enhancement][system] Replace piece_id with comm_net_sequence_number #5731
  • [enhancement][system] beautify stack frame #5686
  • [enhancement][system] Add env ONEFLOW_KERNEL_DISABLE_BLOB_ACCESS_CHECKER #5728
  • [enhancement][system] Add env ONEFLOW_THREAD_ENABLE_LOCAL_MESSAGE_QUEUE #5720
  • [enhancement][system][api][refactor] Refactor functional sub, mul and div apis #5713
  • [feature][system] ddp #5008
  • [enhancement][system][api][refactor] Refactor functional matmul and add apis. #5697
  • [bug][system] Fix ClearKV("plan") #5710
  • [enhancement][system] Rename cpu to async cpu #5712
  • [enhancement][system] Support tensor.to()/to_local() #5271
  • [feature][system][refactor][interface] Multi-Runtime for multi nn.Graph #5683
  • [bug][system][refactor] Add tag for Optional inplace constructor #5619
  • [enhancement][system] Move Global to env scope #5670
  • [enhancement][system] add JUST wrapper #5681
  • [enhancement][system] New sync consistent meta info #5634
  • [enhancement][system][refactor][interface] Refactor RuntimeCtx for multi-runtime #5664
  • [feature][system][interface] Feat: memory shared between EagerTensor with VariableRegst #5649
  • [enhancement][system] Use functional call directly instead of construct a module and then call-Add #5613
  • [enhancement][system] disable eager_op consistent mode #5647
  • [enhancement][system] add msg_penddin_list in ibverbs_qp to optimize qp_init_attr.cap.max_send_wr #5485
  • [enhancement][system] IBVerbsCommNet add knobs #5626
  • [enhancement][system] Prune python tensor #5596
  • [feature][system][interface] Feat: LazyInterpret infer op / tensor ParallelDescScope #5625
  • [enhancement][system] Replace src tick with with wait and send ids #5603
  • [enhancement][system] Support symbol placement type in functional. #5627
  • [enhancement][system][api][refactor][interface] Dev advanced indexing #5559
  • [enhancement][system] Optimize maybe. #5839
  • [enhancement][system] Decorator 4 disable recursive boxing call #5796
  • [enhancement][system] add_eager_boxing_and_op_interpreter_dispatch_error_info #5819
  • [enhancement][system] Kernel CUDA Graphs Support #5725
  • [bug][system] Fix placement print bug #5853
  • [bug][system] when error msg formatting fails, return error->DebugString #5844
  • [enhancement][system][refactor] Rename variables named *parallel_distribution* to *nd_sbp* (1) #5815
  • [feature][system][interface] Support Free EagerTensor caught in nn.Graph build #5777
  • [enhancement][system] Reuse CUDA event / Refine BnInOp2Blob / Refine channel #5837
  • [enhancement][system][serving] fix bug in AddInputOutputOpsPass: check existence of key in HashMap(inferface_lbi2scope_sym_id) #5653
  • [enhancement][system][api] unpack_call: impl new unpack_call_dispatcher for better performance #5820
  • [feature][system] Feat consistent tensor python constructor #5812
  • [feature][system] Support 0shape tensor #5620
  • [documentation][system] fix launcher description #5770
  • [feature][system][interface] Multi-nn.Graph memory reuse by Chunk manager #5658
  • [bug][system] Fix naive b2p error #5806
  • [enhancement][system] set created generator with default rng seed #5801
  • [enhancement][system] enhance_local_to_consistent #5761
  • [feature][system] add flow.randn #5736
  • [enhancement][system] Refactor hierarchical parallel cast autograd #5764
  • [enhancement][system] Collective boxing executor add_plan delete_plan #5495
  • [enhancement][system] Fix throw abort #5795
  • [enhancement][system] DECORATE #5794
  • [enhancement][system] Inferface eager boxing #5682
  • [enhancement][system] extract_consistent_to_consistent_op_expr #5870
  • [enhancement][system] disable backward pass consistent tensor meta check. #5871
  • [enhancement][system] Add CudaStreamIndexGenerator::GenerateNamedStreamIndex #5940
  • [bug][system] Only query PCI bus id when CUDA version >= 11 #5937
  • [enhancement][system] maybe: add JUST_MSG and CHECK_JUST_MSG #5904
  • [bug][system] Fix bug scalar #5950
  • [enhancement][system] framework: fix rvalue reference warnings #5948
  • [purge][system] Remove CudaWorkType #5942
  • [enhancement][system] refactor_symbol #5941
  • [bug][system] consistent_tensor_infer_cache: fix memory leak #5938
  • [feature][system] support to print gpu #5936
  • [enhancement][system] Bugfix static check #5935
  • [bug][system] fix nccl_version log #5934
  • [bug][system] Fix bug of multi-GPU train nn.Graph extra mem cost in rank 0 #5930
  • [enhancement][system] Only gradient acc be scheduled in parallel. #5926
  • [enhancement][bug][system] fix_ddp_bug_on_8_process #5929
  • [enhancement][system] Fix bug error msg format #5866
  • [feature][system] print consistent tensor data #5902
  • [bug][system] Move parse env to the constructor #5922
  • [enhancement][system] Remove GlobalWorkStreamId/GlobalThrdId #5917
  • [bug][system] shared_or_scalar: fix alias warnings #5916
  • [purge][system] Remove CompActor #5919
  • [enhancement][system] Use symbol dtype #5641
  • [enhancement][feature][system] Control Graph / Session / Env's python c++ object destruction #5845
  • [enhancement][bug][system] Sync access and assign indexing tensor. #5907
  • [enhancement][system][api][refactor] Dev consistent arange #5883
  • [enhancement][system] Lazy interpreter for new ConsistentToConsistentOpExpr #5903
  • [bug][system] Fix BUG of LazyInterpret FreeEagerTensor memory shared with regst #5891
  • [bug][system] fix typo in raise RuntimeError #5890
  • [enhancement][system][refactor] Rename the ParallelDistribution class to NdSbp #5814
  • [feature][system] add flow.rand #5722
  • [feature][system] Lazy Interpret support infer default device cpu #5880
  • [enhancement][system] Tensor str #5783
  • [feature][system][interface] Lazy to_consistent #5774
  • [enhancement][system] wait vm empty before exiting #5860
  • [enhancement][system] Eager boxing n to 1 #5949
  • [enhancement][system] add kernel observer #6052
  • [enhancement][ci][system] Optimize ddp broadcast and add speed/memory test in ci #6044
  • [enhancement][system] add var to control only print warning once when blocked #6045
  • [enhancement][system][refactor] Rewrite pow and logical functional apis #6032
  • [enhancement][system] Token seq id #5964
  • [enhancement][documentation][system] Remove python function wrapper. #6012
  • [feature][system] Add timeout and loc for blocking calls #6007
  • [enhancement][system] Eager boxing 1 to n #5943
  • [enhancement][system] Boxing expr #6015
  • [enhancement][system] new_X_to_B #5987
  • [enhancement][system] Add unimplemented return information #5952
  • [enhancement][system] Revert "Faster decorator" #6006
  • [enhancement][system] Throw exception if using advanced indexing for tensor setitem #6001
  • [enhancement][system] Support eager boxing sm 2 sn #5869
  • [enhancement][system] Move framework/local_dep_object.* to the eager directory #5988
  • [enhancement][system] Fix builtin op arg tuple. #5464
  • [feature][system][refactor] Dev functional multiple signatures #5982
  • [enhancement][system] Faster decorator #5996
  • [enhancement][system] Placed nd sbp #5995
  • [feature][system] Support asymmetric input/output/variable tensors in nn.Graph #5983
  • [enhancement][system] LightActor #5868
  • [bug][system] Prevent running oneflow in forked subprocess #5976
  • [bug][system] common/error: fix build error in mac os #5971
  • [bug][system] fix_bug_test_tensor_str #5958
  • [enhancement][system] Refine StreamContext #6191
  • [enhancement][system] container_util: fix VectorAt, remove useless MutMapAt #6172
  • [enhancement][system] Typesafe KernelState #6198
  • [enhancement][system] Primitive based copy task node #6195
  • [feature][system][interface] Lazy support Scalar #6181
  • [enhancement][system] Disable implicit boxing when parallel num eq one #6188
  • [enhancement][system] Primitive #6183
  • [enhancement][system] Remove IDMgr::GetGpuPhyIdFromThrdId/IDMgr::GetDeviceTypeFromThrdId #6169
  • [enhancement][system] remove op_expr_helper inside gradient_funcs #6057
  • [feature][system][api] Add tensor yaml, support export tensor functional api. #6099
  • [feature][system] Plan memory log #6151
  • [feature][system] Add dtype bfloat16 #5304
  • [enhancement][system] StreamContext #6129
  • [bug][system] Fix wrong inplace acc grad #6146
  • [enhancement][system] UserKernel remove job_desc #6144
  • [enhancement][system][api] Fea/graph/add outputs buffer to enable pipeline #6126
  • [enhancement][system] not fuse request for nccl 2.10.3 #6136
  • [bug][system] NewUniqueId thread safe #6141
  • [enhancement][system] XRT remove job_desc #6139
  • [enhancement][system] SystemOpFillJobNamePass #6138
  • [enhancement][system] mv_boxing_folder_to_core #6140
  • [enhancement][system] Refactor boxing interpreter to boxing expr #6134
  • [enhancement][system] Eager boxing one to one #6048
  • [enhancement][system] Vm cpu efficiency #6110
  • [enhancement][system] Naive generic boxing #6116
  • [feature][system] send/recv #5992
  • [enhancement][system] disable_print_stack_in_tensor_numpy #6123
  • [feature][system] add all_reduce by to_consistent #5963
  • [enhancement][system] KernelContext #6084
  • [enhancement][bug][system] Fix sync nccl and async nccl deadlock #6071
  • [bug][system][refactor] Refactor to local #6098
  • [enhancement][system] Replace xor with hash combine (part 1) #6078
  • [enhancement][system] Optimize error message #6073
  • [enhancement][system] Rename Error::xx to Error::xxError #6049
  • [enhancement][system] send formatted msg to glog #5999
  • [feature][bottleneck][bug][system][interface] [Feat.] NNGraph new eager tensor for new variable created in JobPass #6091
  • [bug][system] Fix bug of multi-GPU eager copy D2H extra mem cost in rank 0 #6092
  • [enhancement][system][api] Rename module flow.F to flow._C #6053
  • [feature][system][interface] [Feat.] Eager consistent OFRecordReader #6089
  • [enhancement][system][api] Dev fix and align interface #6075
  • [feature][bottleneck][bug][system][interface] NNGraph input/output valid by register tensors #6240
  • [bug][system][interface] Fix bug of Multi-Client src tick output order #6221
  • [enhancement][bug][system] Add cast primitive #6234
  • [feature][bottleneck][system][interface] Auto FixPipelineStageIdPass #6204
  • [enhancement][system] move scalar to oneflow namespace. #6235
  • [enhancement][system] UserKernel init CUDA Graphs with state #6230
  • [feature][system] Comm broadcast #6213
  • [enhancement][system][refactor] Rename opname to optype_name in AutogradEngine #6154
  • [enhancement][system] Add memset primitive #6218
  • [enhancement][system] Add StreamContext::device_type()/DeviceCtx::device_type() #6217
  • [feature][system] add all_gather and fix bug of multi rank doctest #6189
  • [feature][system][interface] [Feat.] Lazy interpreter skip hierarchical_parallel_cast #6208
  • [purge][system] Cleanup KernelUtil #6212
  • [enhancement][system] StreamContextAdapter #6205
  • [enhancement][system] Dev eliminate gcc warnings #6199
  • [feature][bottleneck][system][interface] [Feat.] nn.Graph support grad acc with input/output tensor #6155
  • [enhancement][system] Cpu symetric s to s #6153
  • [enhancement][system][upload-core] Op expr infer tensor meta #5064
  • [enhancement][system] Infer consistent tensor meta #5362

CI enhancements:

  • [bug][ci][api][interface] Refine module test #5232
  • [enhancement][ci] Add Simple CI, runs CPU-only on GitHub hosted servers #5207
  • [enhancement][ci] Run exe test in CPU-only #5202
  • [enhancement][ci] Cancel all workflow runs but the latest #5206
  • [enhancement][ci] Fix master not running Simple CI #5368
  • [enhancement][ci] Refine Simple CI and Clang analysis #5367
  • [enhancement][feature][bug][ci][documentation][interface] Fix upsample bilinear bug #5363
  • [enhancement][ci] Build nightly for py39 #5318
  • [enhancement][ci] Try distributed run for 3 times to prevent failure #5305
  • [enhancement][ci] Upload Simple CI logs to cloud #5268
  • [enhancement][ci] Remove cpu_op_eager and cuda_op_eager #5470
  • [bug][ci] fix segfault in clang plugin #5437
  • [enhancement][ci] Refine Simple CI error output #5435
  • [enhancement][ci] Add conda env to Simple CI #5385
  • [enhancement][ci] Fix clang plugin core file not found #5390
  • [bug][ci] upload core when build with clang plugin #5384
  • [bug][ci] clang plugin skip more files #5373
  • [enhancement][ci] Use gh-action-scheduler-v2 #5370
  • [enhancement][ci] relax speed threshold #5569
  • [bug][ci] Fix wrong test path under compatible #5567
  • [enhancement][ci][need-simple-ci] Prevent upload logs automatically #5560
  • [enhancement][ci][interface] Add nn.AdaptiveAvgPool1d and nn.AdaptiveAvgPool3d #5445
  • [feature][ci] add speed test in ci #5496
  • [enhancement][ci] Reduce usage of Simple CI #5546
  • [feature][bug][ci][api] Restruct upsample module #5524
  • [feature][ci] multi client launcher test #5488
  • [enhancement][ci] Remove automerge if cuda_new_interface failed #5519
  • [enhancement][ci] Prevent adding subdir in python/test #5514
  • [enhancement][ci] piprepo->pipindex #5517
  • [enhancement][ci] add dynamic_loss_scale in ci tests #5337
  • [enhancement][ci] Add timeout for wait_gpu_slot #5497
  • [enhancement][feature][ci] new static check based on clang-tidy #5476
  • [enhancement][ci] Fix url not downloadable in some browers #5701
  • [feature][ci] multi client multi machine test #5685
  • [enhancement][ci] Add cpu new interface CI #5639
  • [enhancement][ci][need-simple-ci] Mv clangtidy to simple ci #5667
  • [enhancement][ci][need-simple-ci] use clang tidy appimage in ci #5841
  • [enhancement][ci] Use gcc 7 in release to prevent error #5840
  • [enhancement][ci] bn tol 1e-4 => 1e-3 #5811
  • [enhancement][ci] fix distributed run on built dir #5810
  • [enhancement][ci] fix third party mirror check_sum #5802
  • [ci][documentation] find more accurately which files need to be doctested #5782
  • [enhancement][ci] Print stack unconditionally #5779
  • [enhancement][ci][need-simple-ci] Enable more checkers for clang-tidy in CI #5738
  • [enhancement][ci] CI: add clang-tidy check to test.yaml #5920
  • [ci][documentation] fix docstring in oneflow.nn.functional namespace #5807
  • [enhancement][ci] disable TREAT_WARNINGS_AS_ERRORS in Release CI #5886
  • [enhancement][ci] Skip ci jobs by git diff #5863
  • [bug][ci] quick fix #5978 #6030
  • [enhancement][bug][ci] fix clang tidy diff options and file format #5990
  • [enhancement][ci] add flow.relu #5847
  • [enhancement][ci] equal => allclose #6164
  • [bug][ci][need-simple-ci] CI: fix clang tidy checks in simple ci #6161
  • [enhancement][bug][ci][documentation][api] add interpolate and layer_norm docs #6157
  • [bug][ci] update speed test #6113
  • [enhancement][bug][ci][documentation][api] speed import oneflow #6107
  • [bug][ci] Also try install dev deps for CODEGEN_PYTHON_EXECUTABLE #6115
  • [bug][ci][need-simple-ci] set gtest_CMAKE_DEBUG_POSTFIX "d" #6085
  • [enhancement][ci] add cache init file for clang and CI build with clang #6062
  • [enhancement][ci] add emoji in speed test output, make it continue-on-error #6214

Test enhancements:

  • [bug][test][interface] Fix acos ci bug #5217
  • [feature][test] implement automated test #5321
  • [enhancement][test] move generator test into ops folder to accelerate tests #5472
  • [feature][test][api] Add autotest part2 #5467
  • [enhancement][test][api][interface] Add some tests with the new framework for auto testing #5561
  • [bug][test] fix test error when do multi case test on graph #5590
  • [enhancement][test] Refine module test using auto test by yaochi #5484
  • [enhancement][test] Add autotest for BatchNorm2d #5734
  • [enhancement][test] RTH_update_op_test #5823
  • [enhancement][test] dev adamw graph config #5745
  • [feature][test][api][interface] Add new autotest #5562
  • [bug][test] restore test of alexnet graph #5798
  • [enhancement][test][interface] add zhangshen op-test #5600
  • [feature][bug][tooling][test][interface] Record autotest wrong code #5923
  • [enhancement][feature][test][api] add randint #5718
  • [bug][test] fix multi machine test #5984
  • [enhancement][test][interface] some op test #6095

Tooling enhancements:

  • [bug][tooling] user/summary: fix memory leak in FillImageInSummary #5742
  • [enhancement][tooling][cfg] cfg: add move assignment operator for performance #5962
  • [enhancement][tooling][api][refactor] refactor_all_device_placement_api #6080
oneflow - v0.4.0

Published by jackalcooper over 3 years ago

Changelog v0.4.0

Highlights

在这个版本,我们为 OneFlow 新增了大量的功能,0.4.0 是 OneFlow 自开源以来最大的更新。在这个版本中,我们增加了 2-D SBP、流水并行,Checkpoint 的新的接口,以及大量对齐 pytorch 的接口,还支持了 CUDA 11.2。在之前,我们已经开源了 OneFlow 的 GPT 源码,其中大量使用了这个版本的各种新特性,同时也欢迎移步阅读《OneFlow —— 让每一位算法工程师都有能力训练 GPT》这篇文章。

Lazy 模式的功能更新

支持 2-D SBP

  • 转为2维
    with flow.scope.placement("gpu", "0:0-3", (2, 2)):
        x = flow.hierarchical_parallel_cast(
            x, parallel_distribution=["B", "S(1)"]
        )
    
  • 转为1维
    with flow.scope.placement("gpu", "0:0-3", (4,)):
        x = flow.hierarchical_parallel_cast(
            x, parallel_distribution=["S(0)"]
        )
    

支持流水并行的新接口

  • 创建 pipeline_stage 的 scope
with flow.experimental.scope.config(
        pipeline_stage_id_hint=dist_util.get_layer_stage(layer_idx)
    ):
    ...
  • 为了是流水并行能更好的工作,必须使用梯度累加,可以使用有限内存跑更大 batch。通过 config 设置梯度累加的步数:
func_cfg = flow.FunctionConfig()
...
func_cfg.train.num_gradient_accumulation_steps(args.num_accumulation_steps)
@flow.global_function(..., function_config=func_cfg)

支持 ZeRO 优化

  • 开启方式:
func_cfg = flow.FunctionConfig()
...
func_cfg.optimizer_placement_optimization_mode(mode) # mode  = "non_distributed" or "distributed_split"
@flow.global_function(..., function_config=func_cfg)
  • 示例代码请参考这个测试用例
  • mode = "distributed_split" 对应 DeepSpeed ZeRO 优化的 stage 2

支持 Checkpointing 的新接口

with flow.experimental.scope.config(
    checkpointing=True
):

欢迎阅读相关文章:亚线性内存优化—activation checkpointing在oneflow中的实现

Eager 模式的功能更新

提供oneflow.experimental 命名空间,部分对齐 torch.xxx 接口

  • 新接口的使用方法

    import oneflow.experimental as flow
    flow.enable_eager_execution() # 启用 eager
    
  • 目前部分对齐的功能

    flow.nn.Conv2d  <->  torch.nn.Conv2d
    flow.nn.BatchNorm2d  <->  torch.nn.BatchNorm2d
    flow.nn.ReLU  <->  torch.nn.ReLU
    flow.nn.MaxPool2d  <->  torch.nn.MaxPool2d
    flow.nn.AvgPool2d  <->  torch.nn.AvgPool2d
    flow.nn.Linear  <->  torch.nn.Linear
    flow.nn.CrossEntropyLoss  <->  torch.nn.CrossEntropyLoss
    flow.nn.Sequential  <->  torch.nn.Sequential
    
    flow.nn.Module.to  <->  torch.nn.Module.to
    flow.nn.Module.state_dict  <->  torch.nn.Module.state_dict
    flow.nn.Module.load_state_dict  <->  torch.nn.Module.load_state_dict
    
    flow.save  <->  torch.save
    flow.load  <->  torch.load
    
    flow.Tensor  <->  torch.Tensor
    flow.tensor  <->  torch.tensor
    flow.tensor.to  <->  torch.tensor.to
    flow.tensor.numpy  <->  torch.tensor.numpy
    flow.tensor 加减乘除  <->  torch.tensor 加减乘除
    flow.tensor.flatten  <->  torch.tensor.flatten
    flow.tensor.softmax  <->  torch.tensor.softmax
    
    flow.optim.SGD  <->  torch.optim.SGD
    

    基于上述模块已经可以轻松搭建常用网络,如:ResNet、BERT、MobileNetV3 等。后续版本将对齐/支持更多接口,届时可将大多数基于 Pytorch 搭建的网络,轻松切换到 OneFlow。

  • 快速上手例子 lenet: https://github.com/Oneflow-Inc/models/blob/main/quick_start_demo_lenet/lenet.py

  • 新接口文档链接:https://oneflow.readthedocs.io/en/master/experimental.html

  • 对齐 torch vision 的 ResNet50 示例代码:https://github.com/Oneflow-Inc/models/tree/main/resnet50

  • 接下里的几个版本会增加更多 对齐 PyTorch 的接口

  • experimental 下对齐的接口在 0.6.0 版本更新时会被移动到 oneflow 的命名空间下,届时会完全对齐 PyTorch,OneFlow 0.6.0 会将 eager 作为默认的执行方式

  • eager 模式目前只支持单 GPU 运行,在 0.5.0 会支持多 GPU 运行

其他更新

新的 Python Pip 包名和版本号规则

之前一个 OneFlow 的版本采取的是“不同包名,相同版本名”的规则,如 oneflow_cu102==0.3.4,从 0.4.0 之后将采取“相同包名,不同版本名”的规则,如oneflow==0.4.0+cu102,最新安装方式请参考 README Install with Pip Package章节

支持 CUDA 11.2

stable 版本和 nightly 版本的 OneFlow 都支持了 CUDA 11.2 平台(cu112)

ONNX 模块独立仓库

ONNX 模块目前在新仓库 https://github.com/Oneflow-Inc/oneflow_convert_tools 中维护,OneFlow 主仓库中 的 ONNX 相关的代码将在下个版本移除,具体细节可以看《深度学习框架OneFlow是如何和ONNX交互的?》 一文。oneflow_convert_tools 目前是针对 OneFlow 的 lazy 模式开发,目前最新版本号为 v0.3.2,后面针对 eager 模式的 oneflow_convert_tools 版本号将从 0.4.0 开始

"下集预告"

在下一个版本的 OneFlow 中,将包含更全面的 PyTorch 兼容,包括更多更丰富的接口支持以及多 GPU 支持。同时,下个版本的 OneFlow 也将支持动静图转换的功能。敬请期待!

oneflow - Hotfix v0.3.4

Published by jackalcooper almost 4 years ago

oneflow - v0.3.3

Published by jackalcooper almost 4 years ago

Op 修复和性能优化

  • [enhancement][op] reduce sum half kernel #4110
  • [enhancement][op] simplify cosface #4107
  • [enhancement][op] indexed_slices update support weight_decay #4096
  • [enhancement][op][python] Migrate swish and mish namespace from math to nn #4104
  • [enhancement][op] Add elementwise maximum/minimum ops #4069
  • [enhancement][op] Fix Code format warning in hardswish #4105
  • [enhancement][feature][op] Add Scalar Pow #4082
  • [bug][op] Fix bug: mut_shape_view of static output maybe null in UserKernel::ForwardShape #4094
  • [enhancement][op][refactor] Migrate cast_to_static_shape to user op #4095
  • [feature][op] Add GroupNorm op #4089
  • [feature][op] Distributed partial sampler #3857
  • [enhancement][op][python] add Relu6 activation #4029
  • [bug][op] Rename ont_hot_op.cpp to one_hot_op.cpp #4093
  • [bug][op][python] fix hardtanh CI precision error #4091
  • [enhancement][op] add remove_img_without_anno api for COCOReader #4088
  • [enhancement][op] Add Hardtanh activation #4049
  • [enhancement][op] Add ELU activation #4065
  • [enhancement][op][python] Update logsoftmax.py #4041
  • [documentation][op] Fix in_top_k api document #4079
  • [enhancement][op] Add Hardswish activation #4059
  • [enhancement][op][python] Add hard sigmoid #4043
  • [enhancement][op] Dev in top k #3611
  • [bug][op] Fix argwhere tmp buffer infer #4061
  • [enhancement][op] Optimize softmax cuda kernel #4058
  • [feature][op] Add InstanceNorm 1d & 3d implementation #4052
  • [feature][op] Quantization aware training releated ops #3764
  • [enhancement][op] Generic unfold kernel implementation #4033
  • [enhancement][op] User op dim_gather support dynamic input and index #4039
  • [enhancement][op] Reflection pad2d op #3777
  • [enhancement][op] slice support empty blob #4025
  • [bug][enhancement][op] Migrate argwhere to user op #4021
  • [bug][op] Dev rm old tanh #4035
  • [enhancement][op][refactor] Make MaxWithLogThreshold and SafeLog header only #4030
  • [op][purge] Tidy up op_conf.proto #3932
  • [enhancement][op][python] Dev bcewithlogits loss #4024
  • [feature][op] Add implementation of InstanceNorm2D op #4020
  • [enhancement][op][refactor] Refactor gpu_atomic_add #4027
  • [enhancement][op][python] add kldivloss #4012
  • [enhancement][op][python] Dev oneflow ones #3990
  • [enhancement][op] Add flatten/squeeze/expand_dims to auto mixed precision clear list and use reshape instead of reshape_like to do reshape grad computation #4015
  • [enhancement][op][python] add pixel shuffle #4003
  • [enhancement][op] Scalar kernels use element-wise template #4013
  • [enhancement][op][python] add zeros api #3991
  • [enhancement][op] Optimize ComputeEntropyGpu with CUB #3930
  • [feature][op] CUDA template for element-wise kernels #4007

系统组件

  • [enhancement][system] migrate job_build_and_infer api to pybind11 #3940
  • [feature][system] quantization aware training pass #3817
  • [eager][enhancement][system] Mig op arg para attr #4102
  • [feature][system] Tensor Float 32 Support. #4072
  • [enhancement][system] Mig op arg para attr #4090
  • [enhancement][system] Mig py cfg sbp #4086
  • [enhancement][system] Refactor python remote blob #4081
  • [enhancement][system] remove BlobDef #4071
  • [bug][system] Fix warning: moving a local object in a return statement prevents copy elision #4067
  • [enhancement][system] Refactor python blob desc #4063
  • [feature][system] Add nvtx range and thread naming #4064
  • [documentation][enhancement][system] Add docs on installing legacy versions of oneflow #4056
  • [bug][system] support eager empty blob #4047
  • [enhancement][system] Add err info for ncclGroupEnd check #4048
  • [enhancement][system] Optimize dynamic loss scale parameters #4045
  • [purge][system] Remove col_id #4046
  • [enhancement][system] Scope with symbol #4040
  • [enhancement][system] Job desc with symbol #4032
  • [enhancement][system] Parallel desc with symbol #4017
  • [bug][system] change sbp order value for layer norm #3995
  • [bug][system] Fix eager test_resume_training test #4023
  • [bug][system] Fix python cfg error bug #4018
  • [bug][system] Remove redundant pack_size in GenericLauncher #4014
  • [enhancement][system] Set default block size to 512 #4011
  • [feature][system] Remove swig in oneflow #3969
  • [feature][system] Migrate oneflow internal api to pybind11 #3953
  • [build][enhancement][system] Bump nccl from 2.7.3 to 2.8.3 #3875

Eager 模式

  • [bug][eager] Fix eager bug of test split like #4004
  • [bug][eager] add float16 datatype for eager boxing #4092

Python 前端

  • [feature][python] add stack #3897
  • [bug][enhancement][python] Fix test kldivloss tolerance #4103
  • [bug][enhancement][python] Fix "hardsigmoid" eager test error #4085
  • [bug][documentation][python] Add hardsigmoid #4076
  • [api][enhancement][python] add deprecate api optimizer.PolynomialSchduler #4038

工具链

  • [feature][tooling] split_cfg_cpp_and_pybind_generator #4002
  • [enhancement][tooling] Cfg hash #4084
  • [enhancement][tooling] Finetune cfg tool #4050
  • [enhancement][tooling] optimize link time #4042

编译

  • [build][documentation] Add CentOS specific info on README.md #4099
  • [build][enhancement] Disable CUDA_SEPARABLE_COMPILATION #4036

CI

  • [bug][ci] Quit docker after making ssh creadential #4075
  • [bug][ci] Fix CI outputs wrong cmd when printing failed cmd due to shadowed var #4031
  • [ci][enhancement] Upload log of distributed CI #4028
  • [ci][enhancement] Make oneflow worker docker stay alive for 6 hours #4026
  • [ci][enhancement] Allow to keep oneflow_worker log in distributed CI #4022
  • [ci][documentation] userop and general pr templates added #3952
oneflow - v0.3.2

Published by jackalcooper almost 4 years ago

Changelog

v0.3.2 (16/12/2020)

  • [enhancement][system] Migrate foreigns to pybind11 #3939
  • [feature][op][python] add swish activation #3970
  • [bug][op] fix argwhere format #4010
  • [enhancement][op] Argwhere support empty blob #4009
  • [feature][op][python] add mish activation #3972
  • [bug][eager] Fix eager memory leak and re-enable new checkpoint #4008
  • [ci][enhancement] upload bin to oss #4000
  • [enhancement][op] Fuse cast scale #3999
  • [enhancement][op] layer_norm_grad_add_to_output #3998
  • [enhancement][system] Optimize NcclCollectiveBoxingExecutorBackend::ExecuteGroup latency #3997
  • [feature][system] OptimizerPlacementOptimization #3944
  • [enhancement][op] Dev optimize prelu #3987
  • [api][enhancement][op] Switch identity to user op and add it to auto mixed precision clear list #3992
  • [enhancement][op] Optimize slice kernel #3989
  • [bug][op] Hotfix: add parallel cast to amp clear list #3988
  • [bottleneck][enhancement][system] Sublinear memory cost by checkpointing #3976
  • [enhancement][system] Add gradients stats aggregation #3979
  • [feature][system] nccl enable mixed fusion #3981
  • [enhancement][op] fused_scale_tril / hot fix matmul / softmax broadcast_sub broadcast_div #3980
  • [bug][op] add combined margin cpu and fix bug #3961
  • [feature][op] Add multi_square_sum op #3977
  • [bug][op] fix pad op #3971
  • [ci][enhancement][test] larger tol for bn #3965
  • [cfg][enhancement][python] Dev replace py job conf proto to cfg #3856
  • [enhancement][refactor][ssp] Dev ssp fix fuse and add just #3959
  • [cfg][enhancement][refactor][tooling] replace ScopeProto to cfg #3816
  • [feature][op] TripOp add fill value #3960
  • [enhancement][system] remove serialized in python callback #3891
oneflow - v0.3.1

Published by jackalcooper almost 4 years ago

Changelog

v0.3.1 (02/12/2020)

  • [bug][system] Fix CollectiveBoxingGenericTaskNode::ProduceAllRegstsAndBindEdges #3946
  • [bug][op] Fix constant init value #3947
  • [api][enhancement][refactor][tooling] Refine custom op build #3925
  • [feature][op] add combined margin loss #3819
  • [enhancement][tooling] default show cpp error stack frame #3948
  • [cfg][enhancement][tooling] Dev replace py parallel conf proto to cfg #3810
  • [feature][system] Add NaiveB2PSubTskGphBuilder #3942
  • [bug][system] disable new checkpoint by default temporarily #3943
  • [bug][system] Explicitly specify the SBP in NonDistributedOptimizerPass #3937
  • [bug][op] indexed_slices_model_update handle empty tensor #3933
  • [bug][ci] fix oss list file 100 limit #3935
oneflow - v0.2.0

Published by jackalcooper about 4 years ago

Changelog

v0.2.0 (09/10/2020)

Op 修复、性能优化

支持二元 add op 与前驱节点融合

  • FuseAddToOutput #3524
  • Dropout support add_to_output #3569
  • Dev matmul add to output #3581

kernel 性能优化

  • Fused BatchNormAddRelu #3519
  • bn_add_relu use bit mask #3645
  • layer_norm param grad #3604
  • Fused layer norm #3591
  • BiasAdd Row Col Half2 #3636
  • MaskAndScaleHalf2 #3643
  • Optimize CudaAsyncMemoryCopier #3543
  • Avoid using local memory in CropMirrorNormalizeGpuKernel #3539
  • LayerNormGpuKernel use fused InstanceScaleCenter #3573

使用 user op 实现 model update ops,以及 model update ops 支持 fusion

  • Add model update user ops #3546
  • Migrate L1L2RegularizeGradientOp to UserOp Framework #3527
  • model update fuse scalar_mul_by_tensor #3635
  • Dev indexed slices model update user ops #3561
  • Dev adam xla and rm sys op #3584

NCCL 支持设置最大融合 op 数量

  • Add nccl_fusion_max_ops #3567

新 op

  • [feature] Fused ImageDecoderRandomCropResize #3644
  • Add AmpWhiteIdentityOp #3658
  • Add ImageDecoderRandomCropResizeOp::InferParallelSignature #3646
  • Dev add op tril #3511
  • add masked fill op #3515

cuDNN 算法推导支持全局缓存

  • Add CudnnConvAlgoCache #3649

Bugfix 与 其他

  • fix broadcast div grad #3525
  • fix optimizer copy-paste bug #3508
  • fix bug about pad value #3640
  • Optimize some default values #3648
  • Fix cuda runtime #3621
  • Fix reshape inplace #3545
  • Refactor rmsprop mean_square and add unit tests for optimizers #3523
  • Remove cuDNN fields from OperatorConf #3536
  • Add UserOpConfWrapperBuilder::ScopeSymbolId #3528
  • Fix NcclCollectiveBoxing builder_name #3563
  • rm conv2d cpu testcase #3574
  • fix broadcast_to_compatible_with grad bug #3609
  • Add inline for half #3600
  • Fix converter half #3599
  • Fix gpu_atomic_max double overload use fmaxf #3578
  • fix upsample #3579

Eager Execution

给eager相关的代码加上更多注释;微调stateless_call指令,区分mutable_input和output两类不同的参数;实现broadcast指令;
  • fix fmt cuda_copy_d2h_stream_type #3606
  • add comments for cuda_copy_d2h_stream_type.cpp #3603
  • Fix TopoForEachNode in GenCollectiveBoxingPlan #3566
  • Split call_op_kernel instruction args into const_input/mutable_input/output #3562
  • split BlobObject and EagerBlobObject #3485
  • remove unused code under vm/ #3585
  • Dev broadcast instruction #3555
  • Broadcast instruction #3552

pybind11 集成

现在 OneFlow 内 SWIG 和 pybind11 共存,之后会逐步切换到 pybind11
  • pybind11 integration #3517
  • upgrad to pybind11 master and pass exe path #3522
  • Update rel script for pybind11 #3526
  • Dev oneflow pybind api #3625

优化、修复编译工具

修复了一些导致编译失败缓慢的不合理配置、加速了依赖下载、 修复了 ubuntu dockerfile
  • [bug] fix ubuntu docker build #3504
  • change link order to fix the cpu+openblas build #3634
  • [bug] fix bug: oneflow cpu-only lib flags #3615
  • add convert_url_to_oss_https_url and DCN flag #3595
  • Add cn url in readme #3583
  • make absl use tar not git #3570
  • Optimize nvcc gencode flag #3577

Transport 网络传输子系统

支持 P2P 动态网络传输
  • [feature] Transport #3549

集成 CFG 工具

CFG 是基于 proto 语法的、生成跨 python、C++ 数据交互代码的工具
  • Dev integrate cfg #3597
  • Less usage of PbMessage in Operator #3651

XLA 支持优化

升级到了 TF 最新版本
  • upgrade XRT XLA to TF 2.3.0 #3531
  • Fix XLA crash #3548

GRPC 升级

升级到了 GRPC 最新版本
  • Upgrade grpc #3551
  • [bug] [bugfix] GRPC: control server CompletionQueue shutdown. #3589

CI、测试优化

将 XLA 也加入 CI,优化了 op 的测试用例,自动上传 master 最新 commit
  • Parallel unit tests (Step 1, refactor existing unit tests) #3632
  • Add build type for pr oss upload #3627
  • XLA ci support #3564
  • Auto upload tar to aliyun oss #3592
  • Don't pack source code if it is not master #3593
  • move fmt to github hosted #3559
  • refactor ci #3557
  • CtrlTest find available port for ctrl port instead of handwriting #3610

ONNX 支持

优化 IR,更新测试脚本

增加、修订文档

Python 前端修复

  • Fix the bug of using op_module_builder in namespace scope #3513
  • Comment release global for now to avoid random crash in python #3629
  • update lib name in link flags #3623
  • rm spaces in rm_spaces optimizer.py #3619

优化、修复系统通用组件

  • [enhancement] flat ErrorProto error_type #3474
  • [enhancement] Added user_op_conf getter for BatchAxisContext/KernelInitContext/SbpContext #3506
  • [bug] Fix UserOpConfWrapper::has_input/has_output #3507
  • support reflecting cfg message #3655
  • Refactor scope #3652
  • Refactor placement scope #3650
  • Bugfix split config proto and session job set #3637
  • [Bug fix] Release global variables #3624
  • Add OpRegistry::SetAreaId #3608
  • Dev converter #3580
  • Tensor::dptr support half #3582
  • Use InferOutBlobDescsIf instead of InferBlobDescsIf in InferOpNodeLogicalBlobDesc #3535
  • Add ctrl_in_op_name only when unreachable #3537
oneflow -

Published by jackalcooper about 4 years ago

oneflow - version 0.2b1

Published by jackalcooper about 4 years ago

oneflow - 0.2 beta 0

Published by jackalcooper about 4 years ago

oneflow - 0.1.11 beta1

Published by jackalcooper about 4 years ago

Package Rankings
Top 6.75% on Proxy.golang.org
Badges
Extracted from project README
Simple CI Nightly Docker Image Nightly Release Documentation