Bot releases are visible (Hide)

oneflow - Version 1.0.0 Latest Release

Published by levi131 7 months ago

Version 1.0.0

OneFlow v1.0.0 release note

OneFlow v1.0.0 came out, welcome to install the new version for a better experience.

Highlights
New Features
Improvements
Changes and Fixes
Performance

Highlights

This version update includes 447 commits and the following highlights:

Released a new interface compile_from_torch. This interface, while sharing the parameter memory, converts a PyTorch Module instance into a OneFlow Module instance. It supports direct Eager execution or conversion into a static graph nn.Graph, further accelerating the process using MLIR compilation. This interface is rapidly evolving and currently supports dynamic shape compilation, validated across typical models such as ResNet50, Faster RCNN, and Stable Diffusion.
Made a series of optimizations and refactoring to Eager execution runtime, including unification of system memory pools, integration with CUDA native interfaces, optimization of instruction scheduling mechanisms, introduction of an instruction fusion mechanism, optimization of Autograd graph construction speed, optimization of Op inference process, and decoupling of Instruction and Stream, etc.
The static graph distributed physical execution plan supports separate compilation functionality, allowing each process to independently compile its required execution plan, eliminating linear growth of compilation time with GPU scale.
Addition of a series of functional automatic differentiation related interface supports, including jvp, vjp, hvp, vhp, jacobian, and hessian.
Addition of the Insight module, supporting visualization of kernel invocation, execution time, speed, and other related information within the embedded point intervals.
Updates to LiBai (the open-source toolbox for large-scale model training), with native support for fine-tuning and distributed inference of Llama2 and ChatGLM2, supporting full finetune, adapter finetune, lora finetune. lm-eval-harness can be used for language model evaluation and validation.
Upgrade of OneFlow Serving functionality, adding support for OneFlow Python backend and OneFlow Lite backend, in addition to the existing support for OneFlow Cpp backend.

New Features

1. compile_from_torch

The compile_from_torch interface, while sharing the parameter memory, converts a PyTorch Module instance into a OneFlow Module instance. It supports direct Eager execution or conversion into a static graph nn.Graph, further accelerating the process using MLIR compilation. (https://github.com/Oneflow-Inc/oneflow/pull/10404, https://github.com/Oneflow-Inc/oneflow/pull/10408, https://github.com/Oneflow-Inc/oneflow/pull/9984, https://github.com/Oneflow-Inc/oneflow/pull/9754)

Interface Signature and Parameter Introduction:

compile_from_torch(torch_module: torch.nn.Module, \*, use_graph=True, options={})
* torch_module: The Torch Module instance to be converted.
* use_graph: Indicates whether to transform into a static graph nn.Graph and utilize MLIR compilation acceleration. The default is True.
* options:
  * size: When using static graph nn.Graph, the hash value of the graph corresponding to the input shape will be calculated and cached. Size indicates the maximum capacity of the static graph cache. When exceeding the maximum capacity, the graph will be cleared based on the LRU strategy. The default value is 9.
  * dynamic: For the first input with a dynamic shape, the graph will be fully compiled. For subsequent inputs with different shapes, if dynamic is True, shared graph will be used for compilation acceleration; if dynamic is False, the compilation will be performed each time. The default is True.
  * debug: Debug mode and log level settings. -1 disables debug mode, 0 outputs warnings and static graph construction information, 1 additionally outputs graph construction information for each sub-module, 2 additionally outputs progress for each operator, 3 provides more detailed operator information. The default value is -1.

Example of Usage:

import torch
from torchvision import models
import oneflow
from oneflow.framework.infer_compiler import compile_from_torch
DEVICE = torch.device("cuda")
WEIGHT = models.ResNet50_Weights.DEFAULT
model = models.resnet50(weights=WEIGHT).to(DEVICE)
compile_model = compile_from_torch(model, options={"dynamic": True})

2. Separated Compilation

The static graph distributed physical execution plan supports separate compilation , allowing each process to independently compile its required execution plan, thereby preventing linear growth of compilation time with GPU scale. The separate compilation feature supports 3D hybrid parallel (data parallelism + model parallelism + pipeline parallelism) scenarios and can be used together with LiBai (the open-source large-scale model training toolbox). To enable this feature, use the command: export ONEFLOW_ENABLE_LAZY_SEPARATE_COMPILE=1. (https://github.com/Oneflow-Inc/oneflow/pull/9920, https://github.com/Oneflow-Inc/oneflow/pull/10140, https://github.com/Oneflow-Inc/oneflow/pull/10141, https://github.com/Oneflow-Inc/oneflow/pull/10124, https://github.com/Oneflow-Inc/oneflow/pull/10102)

Below are the test results on a 128-card A100-PCIE-40GB device with LiBai on the GPT2 model:

Parallelism	Separated Compilation Enabled	Execution Plan Compilation Time
Data Parallelism (DP128 MP1 PP1)	No	Over 20 minutes
Data Parallelism (DP128 MP1 PP1)	Yes	108.21 s
3D Parallelism (DP4 MP4 PP8)	No	445.16 s
3D Parallelism (DP4 MP4 PP8)	Yes	82.88 s

3. Functional Automatic Differentiation Interfaces

A series of functional automatic differentiation-related interfaces have been introduced, including jvp, vjp, hvp, vhp, jacobian, and hessian. (https://github.com/Oneflow-Inc/oneflow/pull/10412, https://github.com/Oneflow-Inc/oneflow/pull/10428)

Example of Usage:

import oneflow as flow

# jacobian example
def exp_reducer(x):
    return x.exp().sum(dim=1)

input = flow.rand(2, 2)
jac_rslt = flow.autograd.functional.jacobian(exp_reducer, input)

# vhp example
def pow_reducer(x):
    return x.pow(3).sum()

input = flow.rand(2, 2)
v = flow.ones(2, 2)
vhp_rslt = flow.autograd.functional.vhp(pow_reducer, input, v)

4. Insight Module

Introduced a new Insight module, enabling the visualization of kernel invocation, execution time, speed, and other related information within the embedded point intervals. (https://github.com/Oneflow-Inc/oneflow/pull/10370)

Usage:

Step 1: Set embedded point intervals in the code using the OneFlow Profiler module.
Step 2: Run the code and use NVIDIA Nsight Systems to generate a .sqlite file.
Step 3: Use the OneFlow Insight module to generate a .json file.
Step 4: Open the .json file in the browser at chrome://tracing/ or edge://tracing/ to obtain the visualization interface.

For more detailed information, please refer to: https://github.com/Oneflow-Inc/oneflow/tree/master/python/oneflow/utils/insight#usage

5. LiBai Version Update

LiBai (the open-source toolbox for large-scale model training) has been upgraded to version v0.3.0. It now natively supports finetuning and distributed inference of large language models Llama2 and ChatGLM2. It supports full full finetune, adapter finetune, lora finetune. lm-eval-harness can be used for language model evaluation and validation.
The distributed training and inference support for ChatGLM and Llama2 are as follows:

Example of Usage:

# full finetune
bash tools/train.sh projects/Llama/train_net.py projects/Llama/configs/llama_sft.py 8
# adapter finetune
bash tools/train.sh projects/Llama/adapter/train_net.py projects/Llama/adapter/adapter_sft.py 8
# inference
bash tools/infer.sh projects/Llama/pipeline.py 8
# eval
python projects/Llama/utils/eval_adapter.py

6. Other New Features

Added FFT-related operators. (https://github.com/Oneflow-Inc/oneflow/pull/10027)
Added zeta operator. (https://github.com/Oneflow-Inc/oneflow/pull/10189)
Added tril_ operator. (https://github.com/Oneflow-Inc/oneflow/pull/9996)
Added clone operator. (https://github.com/Oneflow-Inc/oneflow/pull/9800)
Added frac and frac_ operator. (https://github.com/Oneflow-Inc/oneflow/pull/9979)
Added exp2 operator. (https://github.com/Oneflow-Inc/oneflow/pull/9958)
Added rrelu operator. (https://github.com/Oneflow-Inc/oneflow/pull/9736)
Added lgamma backward operator. (https://github.com/Oneflow-Inc/oneflow/pull/10177)
Added digamma operator. (https://github.com/Oneflow-Inc/oneflow/pull/10066)
Added trigamma operator. (https://github.com/Oneflow-Inc/oneflow/pull/10117)
Added bitwise_not operator. (https://github.com/Oneflow-Inc/oneflow/pull/9859)
Added squared_relu operator. (https://github.com/Oneflow-Inc/oneflow/pull/10316)
Added skip_rms_norm operator. (https://github.com/Oneflow-Inc/oneflow/pull/10036)
Added multi_tensor_amp_grad_scaler related operators. (https://github.com/Oneflow-Inc/oneflow/pull/10071)
Added bitwise_and, bitwise_or, bitwise_xor operators. (https://github.com/Oneflow-Inc/oneflow/pull/9842)
Added fused_attention_concat_past_key_value operator. (https://github.com/Oneflow-Inc/oneflow/pull/9963)
Added fused_multi_head_attention_inference_v2 operator. (https://github.com/Oneflow-Inc/oneflow/pull/9933)
Added fused_codegeex_qkv_reshape operator. (https://github.com/Oneflow-Inc/oneflow/pull/9927)
Added fused_apply_rotary_emb operator. (https://github.com/Oneflow-Inc/oneflow/pull/9914)
Added skip_layer_norm operator. (https://github.com/Oneflow-Inc/oneflow/pull/9906)
Added groupwise_dequantize, fused_linear_with_groupwise_quantized_weight operators. (https://github.com/Oneflow-Inc/oneflow/pull/9900)
Added fused_scale_mask_bias_softmax, fused_scale_mask_bias_softmax_grad operators. (https://github.com/Oneflow-Inc/oneflow/pull/9867)
Added depend operator for describing dependency relationships in the computation graph. (https://github.com/Oneflow-Inc/oneflow/pull/9807)
Added operators for handling complex data types: real, imag, conj, conj_physical. (https://github.com/Oneflow-Inc/oneflow/pull/10034, https://github.com/Oneflow-Inc/oneflow/pull/10281)
Added CPU support for the nms operator. (https://github.com/Oneflow-Inc/oneflow/pull/10225)
Added support for the cast operator to convert bool to int16 data type. (https://github.com/Oneflow-Inc/oneflow/pull/10211)
Added support for the arange operator for the fp16 data type. (https://github.com/Oneflow-Inc/oneflow/pull/10019)
Added support for the adaptive_avg_pool operator for the fp16 data type. (https://github.com/Oneflow-Inc/oneflow/pull/10004)
Added support for the nonzero operator for the fp16 data type. (https://github.com/Oneflow-Inc/oneflow/pull/9826)
Added support for the exponential operator for the half data type. (https://github.com/Oneflow-Inc/oneflow/pull/10005)
Added support for the arg_sort and top_k operators for the half data type. (https://github.com/Oneflow-Inc/oneflow/pull/10000)
Added support for some basic operators like add, sub, mul, mm, sqrt, div for complex data types. (https://github.com/Oneflow-Inc/oneflow/pull/10269, https://github.com/Oneflow-Inc/oneflow/pull/10136, https://github.com/Oneflow-Inc/oneflow/pull/10284, https://github.com/Oneflow-Inc/oneflow/pull/10049)
Added support for basic binary operators for discontinuous memory input tensors. (https://github.com/Oneflow-Inc/oneflow/pull/9986)
Added a virtual jit interface to support mocking of torch for user code that imports but does not actually use the interface. (https://github.com/Oneflow-Inc/oneflow/pull/10395)
Added the mem_get_info interface to return overall and free memory information for a specified CUDA device. (https://github.com/Oneflow-Inc/oneflow/pull/10398)
Added the tensor.new interface. (https://github.com/Oneflow-Inc/oneflow/pull/9881)
Added the tensor.is_cpu interface. (https://github.com/Oneflow-Inc/oneflow/pull/10172)
Added the tensor.is_view interface. (https://github.com/Oneflow-Inc/oneflow/pull/10101)
Added the tensor.data_ptr interface. (https://github.com/Oneflow-Inc/oneflow/pull/10111, https://github.com/Oneflow-Inc/oneflow/pull/10139)
Added the tensor.baddbmm interface. (https://github.com/Oneflow-Inc/oneflow/pull/9918)
Added interfaces like special.erf, special.erfc, etc. (https://github.com/Oneflow-Inc/oneflow/pull/9982)
Added the layout and frombuffer interfaces. (https://github.com/Oneflow-Inc/oneflow/pull/10171)
Added prune-related interfaces. (https://github.com/Oneflow-Inc/oneflow/pull/9730)
Added the utils.model_zoo interface. (https://github.com/Oneflow-Inc/oneflow/pull/10183)
Added the get_rng_state and get_rng_state_all interfaces. (https://github.com/Oneflow-Inc/oneflow/pull/9760)
Added the set_rng_state and set_rng_state_all interfaces. (https://github.com/Oneflow-Inc/oneflow/pull/10250)
Added support for the float16 data type. (https://github.com/Oneflow-Inc/oneflow/pull/9697)
Added support for the char and short data types. (https://github.com/Oneflow-Inc/oneflow/pull/10086)
Added support for the complex64 and complex128 data types. (https://github.com/Oneflow-Inc/oneflow/pull/9987)
Integrated Transform Dialect into the MLIR codegen process. (https://github.com/Oneflow-Inc/oneflow/pull/10224, https://github.com/Oneflow-Inc/oneflow/pull/10227)
Added code generation support for the matmul operator. 。(https://github.com/Oneflow-Inc/oneflow/pull/10283)
Added code generation support for the softmax operator. (https://github.com/Oneflow-Inc/oneflow/pull/10263, https://github.com/Oneflow-Inc/oneflow/pull/10272)
Added code generation support for the transform.oneflow.apply_patterns operator. (https://github.com/Oneflow-Inc/oneflow/pull/10255)
Added support for byte attributes in the MLIR codegen process. (https://github.com/Oneflow-Inc/oneflow/pull/10276)
Added extra_libs functionality to the mock_torch module, enabling flowvision to mimic torchvision's functionality. (https://github.com/Oneflow-Inc/oneflow/pull/10223)
Added lazy parameter to the mock_torch module, allowing non-existent interfaces to return a fake object without immediate errors. (https://github.com/Oneflow-Inc/oneflow/pull/9876)
Added skip_init functionality and introduced meta device. (https://github.com/Oneflow-Inc/oneflow/pull/10008)
Introduced the HostMemoryInput mechanism, allowing an operator's specific input to be defined as HostMemoryInput type for accessing data within the kernel's host function body. (https://github.com/Oneflow-Inc/oneflow/pull/9928)
Added fusion mechanism for nccl logical operations to reduce excessive synchronization overhead in scenarios like ZERO, where too many fragmented nccl calls lead to significant training speed reduction. (https://github.com/Oneflow-Inc/oneflow/pull/9879)
Introduced a mechanism for re-computation of tensor operations. (https://github.com/Oneflow-Inc/oneflow/pull/9861)
Added support for backward_hook, register_full_backward_hook, and register_state_dict_pre_hook. (https://github.com/Oneflow-Inc/oneflow/pull/9837, https://github.com/Oneflow-Inc/oneflow/pull/9710)
Added content related to the stochastic weight averaging algorithm to the optimizers module. (https://github.com/Oneflow-Inc/oneflow/pull/9781)
Added graph-level flattening algorithm. (https://github.com/Oneflow-Inc/oneflow/pull/9718, https://github.com/Oneflow-Inc/oneflow/pull/9748)
Added DelayVariableOpExecutionPass optimization pass for the computation graph. (https://github.com/Oneflow-Inc/oneflow/pull/9745)
Added MulCastPattern operator fusion rule. (https://github.com/Oneflow-Inc/oneflow/pull/9715)
Added the environment variable ONEFLOW_ENABLE_GLOBAL_INPUTS_WITH_INCONSISTENT_PLACEMENT to control whether to automatically place global tensors used by operators through the to_global operation on the largest rank. (https://github.com/Oneflow-Inc/oneflow/pull/10073)
Added the environment variable ONEFLOW_EAGER_NCCL_USE_COMPUTE_STREAM to control whether nccl and regular computations in eager mode are on the same stream. The default value is false. (https://github.com/Oneflow-Inc/oneflow/pull/10230)
Added the environment variable VLOG_REMAT to handle dynamic graph recomputation logs and interface with ComputeComplexityFn to estimate op computation time. (https://github.com/Oneflow-Inc/oneflow/pull/10212)
Added the environment variable ENABLE_ACTOR_DEBUG_LOG to print detailed logs of actor message sending, receiving, and execution on the current rank. (https://github.com/Oneflow-Inc/oneflow/pull/10081)
Added the environment variable ONEFLOW_RUN_GRAPH_BY_VM to control whether to use VM to run static graph nn.Graph. (https://github.com/Oneflow-Inc/oneflow/pull/9884)
Added the environment variable ONEFLOW_DISABLE_MOCK_TORCH to control whether to disable the mock_torch functionality. (https://github.com/Oneflow-Inc/oneflow/pull/9805)
Added the environment variable ONEFLOW_VM_MULTI_THREAD to control the number of threads used in the VM. (https://github.com/Oneflow-Inc/oneflow/pull/9698)
Added support for the second-order optimizer lbfgs. (https://github.com/Oneflow-Inc/oneflow/pull/10265)

Improvements

1. Eager Runtime Optimization and Refactoring

A series of optimizations and refactoring has been implemented for the Eager runtime, including:

Unified system memory pool to manage memory resources across all allocators on the same device. (https://github.com/Oneflow-Inc/oneflow/pull/8591)
Integration with CUDA native interfaces to accelerate kernel launches.(https://github.com/Oneflow-Inc/oneflow/pull/8571)
Optimization of instruction scheduling mechanism to reduce system overhead.(https://github.com/Oneflow-Inc/oneflow/pull/8796)
Introduction of instruction fusion mechanism to accelerate instruction dispatch. (https://github.com/Oneflow-Inc/oneflow/pull/7399)
Speed improvement in Autograd graph construction. (https://github.com/Oneflow-Inc/oneflow/pull/8606)
Optimization of op deduction process to accelerate kernel execution. (https://github.com/Oneflow-Inc/oneflow/pull/8672, https://github.com/Oneflow-Inc/oneflow/pull/8619, https://github.com/Oneflow-Inc/oneflow/pull/8662)
Consolidation of redundant concepts within the eager runtime, decoupling Instruction and Stream. (https://github.com/Oneflow-Inc/oneflow/pull/8583, https://github.com/Oneflow-Inc/oneflow/pull/8590, https://github.com/Oneflow-Inc/oneflow/pull/7607)

Users can configure the Eager runtime using various environment variables:

Environment Variable	Meaning	Default Value
ONEFLOW_VM_COMPUTE_ON_WORKER_THREAD	Whether to perform computation on worker threads	true
ONEFLOW_VM_MULTI_THREAD	Whether to use multi-threaded collaboration for Eager computation	true
ONEFLOW_VM_ENABLE_STREAM_WAIT	Whether to use stream_wait mechanism for dependencies between multiple streams	true
ONEFLOW_VM_ENABLE_SCHEDULE_YIELD	Whether to use yield mechanism to reduce scheduler thread's busy wait	true
ONEFLOW_EAGER_ENABLE_LOCAL_INFER_CACHE	Whether to cache operator output metadata during computation	true
ONEFLOW_VM_WORKER_THREAD_LIMIT	Number of worker threads	16
ONEFLOW_VM_PENDING_HANDLE_WINDOW_SIZE	Maximum size for fusing vm instructions	10
ONEFLOW_VM_BLOCKING_DEBUG_INSTRUCTIONS_DISPLAY_LIMIT	Number of unprocessed instructions to be printed when vm execution times out	1000

2. Upgrade of OneFlow Serving Features

OneFlow Serving features have been upgraded to support additional backends, including OneFlow Python backend and OneFlow Lite backend, in addition to the existing support for the OneFlow Cpp backend.

The OneFlow Cpp backend enables deployment in a Python-independent environment to achieve the highest performance.
The OneFlow Lite backend enables deployment on edge devices.
The OneFlow Python backend facilitates the deployment of complex models with minimal migration cost.

For usage instructions, refer to: https://github.com/Oneflow-Inc/serving/blob/main/README.md

3. Other Functionality Improvements

Optimized certain code implementations to accommodate CUDA 12.x. (https://github.com/Oneflow-Inc/oneflow/pull/10367)
Optimized the glu operator implementation to support bias-less inputs.(https://github.com/Oneflow-Inc/oneflow/pull/9874)
Optimized pooling operator implementation to support the channels_last parameter. (https://github.com/Oneflow-Inc/oneflow/pull/10242)
Optimized the flip operator implementation to address memory access inefficiencies when dim = -1. (https://github.com/Oneflow-Inc/oneflow/pull/10310)
Optimized the bincount operator implementation for accelerated performance. (https://github.com/Oneflow-Inc/oneflow/pull/10308)
Optimized the index_add operator implementation by dispatching varied logic based on index length to enhance performance for smaller indices.(https://github.com/Oneflow-Inc/oneflow/pull/9751)
Optimized the topk operator implementation to boost performance when batch size equals 1. (https://github.com/Oneflow-Inc/oneflow/pull/10009)
Optimized implementations of operators such as conv and arange to facilitate CUDA graph usage. (https://github.com/Oneflow-Inc/oneflow/pull/9761)
Optimized the upsample operator implementation to include input/output size validation.(https://github.com/Oneflow-Inc/oneflow/pull/9737)
Optimized the grouped_matmul_bias operator implementation by introducing tensor parallelism sbp derivation rules. (https://github.com/Oneflow-Inc/oneflow/pull/9934)
Optimized the reshape operator implementation with added nd sbp derivation rules. (https://github.com/Oneflow-Inc/oneflow/pull/9858)
Optimized error messages and completed test cases for mask_fill and in_top_k operators. (https://github.com/Oneflow-Inc/oneflow/pull/10062)
Optimized the higher-order differentiation rules for the tanh operator to optimize performance under third-order differentiation. (https://github.com/Oneflow-Inc/oneflow/pull/10188, https://github.com/Oneflow-Inc/oneflow/pull/10237)
Optimized conv interface implementation to support device and dtype parameters. (https://github.com/Oneflow-Inc/oneflow/pull/10228)
Optimized conv interface implementation to automatically expand input dimensions.(https://github.com/Oneflow-Inc/oneflow/pull/9721)
Optimized sum interface implementation to accommodate dtype parameters.(https://github.com/Oneflow-Inc/oneflow/pull/10204)
Optimized softmax interface implementation to support dtype parameters. (https://github.com/Oneflow-Inc/oneflow/pull/10069)
Optimized maxpool interface implementation to support 3D input tensors. (https://github.com/Oneflow-Inc/oneflow/pull/10110)
Optimized ctc_loss interface implementation parameters with PyTorch interface. (https://github.com/Oneflow-Inc/oneflow/pull/9887)
Optimized copy interface implementation to support scenarios where input and output have different devices and dtypes. (https://github.com/Oneflow-Inc/oneflow/pull/9888)
Optimized grad interface implementation to support the allow_unused parameter.(https://github.com/Oneflow-Inc/oneflow/pull/10251)
Optimized load interface implementation to provide more user-friendly error messages.(https://github.com/Oneflow-Inc/oneflow/pull/10138)
Optimized fused_matmul_bias operator and interface implementation to support alpha and beta parameters. (https://github.com/Oneflow-Inc/oneflow/pull/10015)
Optimized normal operator and interface implementation to align behavior with PyTorch. (https://github.com/Oneflow-Inc/oneflow/pull/10185)
Optimized fused attention operator and interface implementation to allow None for pasti_key and past_value. (https://github.com/Oneflow-Inc/oneflow/pull/9977)
Optimized fused_attention operator and interface implementation to add support for variable sequence lengths. (https://github.com/Oneflow-Inc/oneflow/pull/9991)
Optimized fused_multi_head_attention_inference operator and interface implementation to include attn_bias parameter. (https://github.com/Oneflow-Inc/oneflow/pull/9853)
Optimized bn-related functor implementation. Merging bn_add_relu and bn_relu operations to expedite inference. (https://github.com/Oneflow-Inc/oneflow/pull/10239)
Optimized MLIR CodeGen-based processes and upgraded LLVM version to 16.0.0. (https://github.com/Oneflow-Inc/oneflow/pull/9985)
Optimized MLIR codegen-based processes by adding AppendOneFlowStream, MgpuToOneFlowStream, and CastOneFlowInputToSignlessPass passes. (https://github.com/Oneflow-Inc/oneflow/pull/10149, https://github.com/Oneflow-Inc/oneflow/pull/10151, https://github.com/Oneflow-Inc/oneflow/pull/10099)
Optimized MLIR codegen-based processes by linking LibDevice to support NVVM IR conversion to cubin. (https://github.com/Oneflow-Inc/oneflow/pull/10200)
Optimized MLIR codegen-based processes by utilizing tmpbuffer as MemPool in MLIR. (Oneflow-Inc/oneflow#10159)
Optimized MLIR codegen-based processes by enabling bufferizable operator dispatch. (https://github.com/Oneflow-Inc/oneflow/pull/9787)
Optimized MLIR codegen-based processes to expedite ofmempool and related processes. (https://github.com/Oneflow-Inc/oneflow/pull/10152, https://github.com/Oneflow-Inc/oneflow/pull/10168, https://github.com/Oneflow-Inc/oneflow/pull/10184, https://github.com/Oneflow-Inc/oneflow/pull/10239)
Optimized stacktrace call stack information.(https://github.com/Oneflow-Inc/oneflow/pull/9912, https://github.com/Oneflow-Inc/oneflow/pull/9937, https://github.com/Oneflow-Inc/oneflow/pull/10260, https://github.com/Oneflow-Inc/oneflow/pull/10161)
Optimized random number generator implementation by adding caching to avoid regeneration with each call. (https://github.com/Oneflow-Inc/oneflow/pull/10387)
Optimized graph load functionality to support loading the graph onto a new device.(https://github.com/Oneflow-Inc/oneflow/pull/10335)
Optimized dummy array initialization implementation using fold expressions. (https://github.com/Oneflow-Inc/oneflow/pull/10271)
Optimized MemoryFormat class organization, exposed to Python layer via cpython to support changing tensor's MemoryFormat using Tensor.to interface. (https://github.com/Oneflow-Inc/oneflow/pull/10181)
Optimized implementations of steam, device, and vm to support more device types. (https://github.com/Oneflow-Inc/oneflow/pull/10166)
Optimized error messages for MapAt, adding printing of key values.(https://github.com/Oneflow-Inc/oneflow/pull/10090)
Optimized OOM error messages to differentiate CUDA and CPU devices and display size. (https://github.com/Oneflow-Inc/oneflow/pull/9938)
Optimized error messages for CHECK_XX_OR_RETURN macros. (https://github.com/Oneflow-Inc/oneflow/pull/9921)
Optimized error messages for graph-related issues. (https://github.com/Oneflow-Inc/oneflow/pull/9821)
Optimized error messages for convolution operator-related issues. (https://github.com/Oneflow-Inc/oneflow/pull/9707)
Optimized model initialization to minimize additional overhead. (https://github.com/Oneflow-Inc/oneflow/pull/10088)
Optimized thread manager implementation to accommodate three usage scenarios: unrestricted threads, master as a thread, and n threads. (https://github.com/Oneflow-Inc/oneflow/pull/10060)
Optimized numpy array release mechanism to release in the main thread to reduce time-consuming GIL requests. (https://github.com/Oneflow-Inc/oneflow/pull/10050)
Optimized graph save runtime_state_dict implementation to enhance performance and address related issues. (https://github.com/Oneflow-Inc/oneflow/pull/10016)
Optimized parsing of different calling methods for interfaces like Tensor.foo(*args) using a unified PyParseArgs function. (https://github.com/Oneflow-Inc/oneflow/pull/9983)
Optimized the implementation of the ArgsTree class to support arbitrary output types and conducted file location migration. (https://github.com/Oneflow-Inc/oneflow/pull/9846)
Optimized memory allocation mechanism to achieve ordered allocation based on streams. (https://github.com/Oneflow-Inc/oneflow/pull/9818)

Changes and Fixes

1. Functional Changes

Removed deallocate context. (https://github.com/Oneflow-Inc/oneflow/pull/10143)
Removed debug compilation mode in graph compilation. (https://github.com/Oneflow-Inc/oneflow/pull/10145)
Removed unused logic for MemChain merge. (https://github.com/Oneflow-Inc/oneflow/pull/10097)
Removed default settings for some unused distributed environment variables. (https://github.com/Oneflow-Inc/oneflow/pull/9803)
Refactored collective boxing implementation under lazy mode. (https://github.com/Oneflow-Inc/oneflow/pull/10098)
Refactored registration of EagerCclS2S.(https://github.com/Oneflow-Inc/oneflow/pull/10100)
Refactored implementation of collective_boxing_executor_backend. (https://github.com/Oneflow-Inc/oneflow/pull/10082)
Refactored implementation of running global nn.graph using VM. (https://github.com/Oneflow-Inc/oneflow/pull/10048)
Refactored implementation of local to global related interfaces.(https://github.com/Oneflow-Inc/oneflow/pull/9870)
Refactored operator dispatch dialect implementation in MLIR codegen process. (https://github.com/Oneflow-Inc/oneflow/pull/9693)
Refactored implementation of random generator and distribution kernels. (https://github.com/Oneflow-Inc/oneflow/pull/9691)
Refactored implementation of fast_atomic_add operator. (https://github.com/Oneflow-Inc/oneflow/pull/9680)
Refactored error check related macros in glog. (https://github.com/Oneflow-Inc/oneflow/pull/10176)
Refactored implementation of random generator. (https://github.com/Oneflow-Inc/oneflow/pull/10025)
Refactored implementation of some elementwise primitive operations. (https://github.com/Oneflow-Inc/oneflow/pull/9857)
Refactored code related to device descriptions. (https://github.com/Oneflow-Inc/oneflow/pull/9791)
Refactored implementation of ParseDeviceString and ParseDeviceNameConf. (https://github.com/Oneflow-Inc/oneflow/pull/9833)
Refactored implementation of ActorMsg related functionalities, introducing IBVerbsActorMsgWrapper wrapper to reduce the size of ActorMsg. (https://github.com/Oneflow-Inc/oneflow/pull/9762)
Refactored implementation of save and load interfaces, migrating the method of saving graphs to the _save_graph function, adding some _open* helper classes to differentiate between paths and memory, enabling saving weights to BytesIO in save, and supporting file streaming in load. (https://github.com/Oneflow-Inc/oneflow/pull/10021)
Refactored implementation of some tensor-related interfaces, migrating code from Python layer to C++ layer. (https://github.com/Oneflow-Inc/oneflow/pull/9990, https://github.com/Oneflow-Inc/oneflow/pull/9964)
Upgraded PyBind version used in the project to 2.11.1. (https://github.com/Oneflow-Inc/oneflow/pull/10391)

2. Bug Fixes

Fixed default dynamic linking settings in CMake files to avoid LLVM15 linking errors. (https://github.com/Oneflow-Inc/oneflow/pull/10373, https://github.com/Oneflow-Inc/oneflow/pull/10131)
Fixed cast-related bugs in MLIR codegen. (https://github.com/Oneflow-Inc/oneflow/pull/10105)
Fixed logic handling for cpg attr in Module._apply function. (https://github.com/Oneflow-Inc/oneflow/pull/10343)
Fixed inheritance issue for DummyModule when attr is mro_entries. (https://github.com/Oneflow-Inc/oneflow/pull/9976)
Fixed size checking issue for _handle_size_arg in full op. (https://github.com/Oneflow-Inc/oneflow/pull/9975)
Fixed residual environment variables after launching mock via command line, causing subsequent API mock parameter errors. (https://github.com/Oneflow-Inc/oneflow/pull/9970)
Fixed inability to exit when two processes encounter exceptions. (https://github.com/Oneflow-Inc/oneflow/pull/10054)
Fixed bug in grouped quantization sbp derivation. (https://github.com/Oneflow-Inc/oneflow/pull/10132)
Fixed kMaxInputCount check issue in GroupedMatmulFunctor. (https://github.com/Oneflow-Inc/oneflow/pull/10322)
Fixed 0-size tensor broadcast issue.(https://github.com/Oneflow-Inc/oneflow/pull/10186)
Fixed issue where double type attr was not updated when using shared_graph. (https://github.com/Oneflow-Inc/oneflow/pull/10279)
Fixed data type error in GetItemInScalarTensor. (https://github.com/Oneflow-Inc/oneflow/pull/10226)
Fixed gradient issue in GroupNorm, calling GroupNormParamGrad only when gamma and beta gradients are required. (https://github.com/Oneflow-Inc/oneflow/pull/10045)
Fixed error when reading tensors with partial ranks in global mode. (https://github.com/Oneflow-Inc/oneflow/pull/10056)
Fixed control boundary issues in checkpointing under PP, affecting task graph construction under separate compilation. (https://github.com/Oneflow-Inc/oneflow/pull/10057)
Fixed bug when using 3D parallelism and enabling activation checkpointing simultaneously. (https://github.com/Oneflow-Inc/oneflow/pull/10031)
Fixed adaptation bug of AutoMixedPrecision pass on non-CUDA devices and bug related to device combinations in LayerNorm Module. (https://github.com/Oneflow-Inc/oneflow/pull/10026)
Fixed default value setting issue for reduce parameter in scatter operator. (https://github.com/Oneflow-Inc/oneflow/pull/10002)
Fixed incomplete disable of some Torch variables in mock.disable, causing lingering references in other globals. (https://github.com/Oneflow-Inc/oneflow/pull/9989)
Fixed destructor issue in vm::TensorStorage. (https://github.com/Oneflow-Inc/oneflow/pull/9962)
Fixed offload issue where small tensors were not released from CUDA memory.(https://github.com/Oneflow-Inc/oneflow/pull/9974)
Fixed occasional segmentation fault in Python stack getter due to thread unsafety.(https://github.com/Oneflow-Inc/oneflow/pull/9955)
Fixed element lookup issue in set under separate compilation scenario. (https://github.com/Oneflow-Inc/oneflow/pull/9952)
Aligned qkv and output_layout in fused_multi_head_attention operator. (https://github.com/Oneflow-Inc/oneflow/pull/9950)
Fixed inconsistency in seed behavior of random series operators between graph and checkpointing. (https://github.com/Oneflow-Inc/oneflow/pull/9941)
Fixed parameter reload failure issue in Eager mode. (https://github.com/Oneflow-Inc/oneflow/pull/9935)
Fixed infinite loop issue in specific cases of mock torch lazy functionality. (https://github.com/Oneflow-Inc/oneflow/pull/9926)
Fixed issue where code in stft_kernel.cu file was not compiled by default. (Oneflow-Inc/oneflow#9922)
Fixed deadlock and memory allocation errors caused by invalid topological order due to incomplete TaskGraph under separate compilation in order_in_graph. (https://github.com/Oneflow-Inc/oneflow/pull/9909 )
Fixed xrt compilation issue where fmt could not be found. (https://github.com/Oneflow-Inc/oneflow/pull/9894)
Fixed imbalance in GPU memory allocation among processes during local to global process where sbp is B. (https://github.com/Oneflow-Inc/oneflow/pull/9852)
Aligned OneFlow and PyTorch behaviors related to the third parameter of CTCLoss. (https://github.com/Oneflow-Inc/oneflow/pull/9845)
Fixed initialization issues related to thread_global_id and rank_group_scope. (https://github.com/Oneflow-Inc/oneflow/pull/9841)
Fixed inplace handling errors in dropout operator implementation. (https://github.com/Oneflow-Inc/oneflow/pull/9808)
Fixed errors in loading non-tensor objects saved by PyTorch in the load function. (https://github.com/Oneflow-Inc/oneflow/pull/9804)
Fixed conflicts between contiguous memory and GPU memory allocation strategies. (https://github.com/Oneflow-Inc/oneflow/pull/9786)
Fixed memory allocation issues in EagerBlobObject::ByteSizeOfBlobBody when considering non-contiguous cases. (https://github.com/Oneflow-Inc/oneflow/pull/9782)
Fixed dtype inference errors in fill_ operator during autocast. (https://github.com/Oneflow-Inc/oneflow/pull/9776)
Fixed sbp derivation rule issues in fused_glu operator. (https://github.com/Oneflow-Inc/oneflow/pull/10108)
Fixed issues related to calling nn.Graph.__map_io. (https://github.com/Oneflow-Inc/oneflow/pull/10084)
Fixed inconsistency between set_grad_mode interface and PyTorch behavior. (https://github.com/Oneflow-Inc/oneflow/pull/10059)
Fixed an issue related to the map_location parameter in the load interface and added support for passing lambda functions. (https://github.com/Oneflow-Inc/oneflow/pull/10052)
Fixed stride inference errors after unsqueeze operation in view mode. (https://github.com/Oneflow-Inc/oneflow/pull/9775)
Fixed problems in conv op with unbatched input and bias, and added support for unbatched input in deconv op. (https://github.com/Oneflow-Inc/oneflow/pull/9740)
Fixed logic errors in trunc_normal_ implementation. (https://github.com/Oneflow-Inc/oneflow/pull/9711)
Fixed default value issue in dim parameter of topk operator. (https://github.com/Oneflow-Inc/oneflow/pull/9703)
Fixed issues where placement of some networks was incorrectly set to CPU during static graph printing. (https://github.com/Oneflow-Inc/oneflow/pull/9770)
Fixed conflict between include paths of trt_flash_attention and native flash attention. (https://github.com/Oneflow-Inc/oneflow/pull/9750)
Fixed segmentation fault caused by is_shutting_down and gil in stack getter. (https://github.com/Oneflow-Inc/oneflow/pull/9681)
Fixed issues related to the separate compilation feature found in distributed unit testing.(https://github.com/Oneflow-Inc/oneflow/pull/9749)
Fixed memory handling issues in flatten algorithm implementation. (https://github.com/Oneflow-Inc/oneflow/pull/9746)
Fixed a deadlock issue in the execution flow. (https://github.com/Oneflow-Inc/oneflow/pull/9738)
Fixed errors in isinstance check for DummyModule. (https://github.com/Oneflow-Inc/oneflow/pull/10207)
Corrected behavior where default size was erroneously overridden when introducing llvm::SmallVector. (https://github.com/Oneflow-Inc/oneflow/pull/9932)
Fixed errors in calculating memory size of non-contiguous memory tensors. (https://github.com/Oneflow-Inc/oneflow/pull/9819)
Fixed issues with calling CHECK_JUST in the TensorStorage destructor function. (https://github.com/Oneflow-Inc/oneflow/pull/9752)

Performance

1. OneFlow compile_from_torch VS PyTorch compile

Compile and execute the backbone parts of ResNet50 and Faster RCNN models using OneFlow compile_from_torch and PyTorch compile interfaces to test the inference performance with inputs of different shapes. The results are shown in the table below:

Model	input shape	PyTorch compile	OneFlow compile_from_torch	dynamic	test timing
ResNet50	(1, 3, 512, 512)	21.328 s	3.205 s	False	initial compilation and execution
ResNet50	(2, 3, 896, 512)	14.167 s	1.523 s	False	continuous compilation and execution
ResNet50	(2, 3, 512, 896)	13.364 s	1.402 s	False	continuous compilation and execution
ResNet50	(3, 3, 896, 896)	15.056 s	1.539 s	False	continuous compilation and execution
ResNet50	(2, 3, 1024, 896)	14.167 s	1.500 s	False	continuous compilation and execution
ResNet50	(2, 3, 896, 1024)	12.891 s	1.494 s	False	continuous compilation and execution
ResNet50	(6, 3, 1024, 1024)	14.859 s	1.872 s	False	continuous compilation and execution
ResNet50	(1, 3, 512, 512)	170.446 s	3.143 s	True	initial compilation and execution
ResNet50	(2, 3, 896, 512)	185.672 s	0.851 s	True	continuous compilation and execution
ResNet50	(2, 3, 512, 896)	0.089 s	0.836 s	True	continuous compilation and execution
ResNet50	(3, 3, 896, 896)	0.084 s	0.980 s	True	continuous compilation and execution
ResNet50	(2, 3, 1024, 896)	0.077 s	0.942 s	True	continuous compilation and execution
ResNet50	(2, 3, 896, 1024)	0.080 s	0.931 s	True	continuous compilation and execution
ResNet50	(6, 3, 1024, 1024)	0.084 s	1.406 s	True	continuous compilation and execution
Faster RCNN	(1, 3, 512, 512)	18.224 s	5.483 s	False	initial compilation and execution
Faster RCNN	(2, 3, 896, 512)	9.200 s	3.011 s	False	continuous compilation and execution
Faster RCNN	(2, 3, 512, 896)	9.331 s	3.025 s	False	continuous compilation and execution
Faster RCNN	(3, 3, 896, 896)	9.301 s	2.854 s	False	continuous compilation and execution
Faster RCNN	(2, 3, 1024, 896)	9.290 s	2.805 s	False	continuous compilation and execution
Faster RCNN	(2, 3, 896, 1024)	9.123 s	2.851 s	False	continuous compilation and execution
Faster RCNN	(6, 3, 1024, 1024)	9.377 s	3.180 s	False	continuous compilation and execution
Faster RCNN	(1, 3, 512, 512)	25.444 s	5.430 s	True	initial compilation and execution
Faster RCNN	(2, 3, 896, 512)	25.381 s	1.899 s	True	continuous compilation and execution
Faster RCNN	(2, 3, 512, 896)	0.116 s	1.886 s	True	continuous compilation and execution
Faster RCNN	(3, 3, 896, 896)	1.982 s	1.793 s	True	continuous compilation and execution
Faster RCNN	(2, 3, 1024, 896)	0.114 s	1.803 s	True	continuous compilation and execution
Faster RCNN	(2, 3, 896, 1024)	0.111 s	1.778 s	True	continuous compilation and execution
Faster RCNN	(6, 3, 1024, 1024)	0.143 s	2.110 s	True	continuous compilation and execution

Using the OneFlow compile_from_torch and PyTorch compile interfaces, the unet section of the Stable Diffusion model was compiled and executed to test the inference performance with outputs of different shapes. The results are presented in the table below:

Model	Output shape	PyTorch compile	OneFlow compile_from_torch	dynamic	test timing
Stable Diffusion	(2, 512, 512)	103.701 s	63.670 s	False	initial compilation and execution
Stable Diffusion	(1, 512, 768)	95.137 s	53.864 s	False	continuous compilation and execution
Stable Diffusion	(2, 768, 512)	90.259 s	55.271 s	False	continuous compilation and execution
Stable Diffusion	(1, 768, 768)	90.196 s	51.590 s	False	continuous compilation and execution
Stable Diffusion	(2, 512, 512)	275.660 s	57.117 s	True	initial compilation and execution
Stable Diffusion	(1, 512, 768)	345.774 s	43.752 s	True	continuous compilation and execution
Stable Diffusion	(2, 768, 512)	349.835 s	47.653 s	True	continuous compilation and execution
Stable Diffusion	(1, 768, 768)	7.224 s	45.720 s	True	continuous compilation and execution
Stable Diffusion	(2, 512, 512)	4.088 s	2.831 s	False	subsequent execution
Stable Diffusion	(1, 512, 768)	3.296 s	2.325 s	False	subsequent execution
Stable Diffusion	(2, 768, 512)	5.594 s	5.157 s	False	subsequent execution
Stable Diffusion	(1, 768, 768)	4.713 s	3.557 s	False	subsequent execution
Stable Diffusion	(2, 512, 512)	4.448 s	2.801 s	True	subsequent execution
Stable Diffusion	(1, 512, 768)	3.201 s	2.314 s	True	subsequent execution
Stable Diffusion	(2, 768, 512)	6.093 s	4.166 s	True	subsequent execution
Stable Diffusion	(1, 768, 768)	4.920 s	3.557 s	True	subsequent execution

Conclusion: The OneFlow compile_from_torch interface generally has shorter compilation times compared to the PyTorch compile interface. Additionally, benefiting from the exceptional operator optimizations in the OneFlow framework, there is superior execution performance on the Stable Diffusion model.

Note: The tests were conducted with GPU 3090, PyTorch v2.1.2 and CUDA 12.2.

2. OneFlow Eager vs PyTorch Eager

Model	GPU model	number of GPUs	macro batch	PyTorch performance(iter/s)	OneFlow performance(iter/s)	speedup ratio
ResNet50	3090	1	1	31.37	38.81	23.72%
ResNet50	3090	1	2	32.06	48.45	51.12%
ResNet50	3090	2	1	31.10	33.46	7.59%
ResNet50	3090	2	2	31.76	34.83	9.67%
ResNet50	A100	1	1	24.60	46.64	89.59%
ResNet50	A100	1	2	25.06	49.88	99.04%
ResNet50	A100	2	1	25.28	39.18	54.98%
ResNet50	A100	2	2	24.09	32.84	36.32%
Bert	3090	1	1	8.93	10.41	16.57%
Bert	3090	1	2	13.11	14.31	9.15%
Bert	3090	2	1	6.94	8.27	19.16%
Bert	3090	2	2	12.19	15.58	27.81%
Bert	A100	1	1	10.45	12.72	21.72%
Bert	A100	1	2	20.24	21.57	6.57%
Bert	A100	2	1	12.63	16.09	27.39%
Bert	A100	2	2	24.86	29.84	20.03%

Conclusion: Compared to PyTorch Eager, using OneFlow Eager shows significant performance advantages in small batch scenarios for both ResNet50 and BERT models.

Note: The tests were conducted using PyTorch v2.1.0 and CUDA 12.1.

Version 1.0.0

OneFlow v1.0.0 release note

OneFlow 发布 v1.0.0 版本, 欢迎大家安装使用。

重点内容
新特性
功能改进
改动与修复
性能

重点内容

本次版本更新包含 447 个 commits 和如下重点内容：

发布新接口 compile_from_torch。该接口在共享参数显存的情况下，将 PyTorch 的 Module 实例转化成 OneFlow 的 Module 实例，支持直接 Eager 运行或者转化为静态图 nn.Graph 并进一步使用 MLIR 编译加速。该接口仍在快速演进中，目前支持了动态形状编译并在ResNet50、Faster RCNN、Stable Diffusion三个典型模型上做了验证。
对 Eager 运行时做了一系列优化与重构，包括统一系统内存池、对接 CUDA 原生接口、优化指令调度机制、引入指令融合机制、优化 Autograd 构图速度、优化 Op 推导过程、解耦 Instruction 与 Stream 等。
静态图分布式物理执行计划支持分离编译功能，每个进程独立编译自己所需的执行计划，使得编译时间不再随 GPU 规模线性增长。
新增一系列函数式自动微分相关接口支持，包括 jvp、vjp、hvp、vhp、jacobian、hessian。
新增 Insight 模块，支持可视化地展示埋点区间内 kernel 调用、执行时间、速度等信息。
大规模模型训练开源工具箱 LiBai 版本更新，原生支持大语言模型 Llama2 和 ChatGLM2 的 finetune 和分布式推理，支持 full finetune、adapter finetune、lora finetune，可使用 lm-eval-harness 对语言模型进行评测验证。
OneFlow Serving 功能升级，在原有支持 OneFlow Cpp 后端的基础上，新增支持 OneFlow Python 后端和 OneFlow Lite 后端。

新特性

1、compile_from_torch

compile_from_torch 接口在共享参数显存的情况下，将 PyTorch 的 Module 实例转化成 OneFlow 的 Module 实例，支持直接 Eager 运行或者转化为静态图 nn.Graph 并进一步使用 MLIR 编译加速。(https://github.com/Oneflow-Inc/oneflow/pull/10404, https://github.com/Oneflow-Inc/oneflow/pull/10408, https://github.com/Oneflow-Inc/oneflow/pull/9984, https://github.com/Oneflow-Inc/oneflow/pull/9754)

接口签名及参数介绍：

compile_from_torch(torch_module: torch.nn.Module, \*, use_graph=True, options={})
* torch_module：需要被转换的 Torch Module 实例。
* use_graph：是否转化为静态图 nn.Graph 并使用 MLIR 编译加速，默认为 True。
* options：
  * size: 使用静态图 nn.Graph 后会根据输入的 shape 计算 hash 值缓存相应的 graph ，size 表示静态图缓存的最大容量，超过最大容量会根据 LRU 策略对 graph 进行清理，默认值为 9。
  * dynamic：对于动态 shape 的输入第一次会完整编译 graph，之后的对于不同 shape 的输入当 dynamic 为 True 时会启用共享图进行编译加速，dynamic 为 False 时每次都会重新进行编译，默认为 True。
  * debug：调试模式和日志级别设置，-1 禁用调试模式，0 输出警告和静态图构建信息，1 额外输出每个子模块的构图信息，2 额外输出每个算子的进度，3 输出更详细的算子信息，默认为 -1。

使用示例：

import torch
from torchvision import models

import oneflow
from oneflow.framework.infer_compiler import compile_from_torch

DEVICE = torch.device("cuda")
WEIGHT = models.ResNet50_Weights.DEFAULT
model = models.resnet50(weights=WEIGHT).to(DEVICE)
compile_model = compile_from_torch(model, options={"dynamic": True})

2、分离编译

静态图分布式物理执行计划支持分离编译功能，每个进程独立编译自己所需的执行计划，使得编译时间不再随 GPU 规模线性增长。分离编译功能支持 3D 混合并行(数据并行+模型并行+流水并行)场景，可与大规模模型训练开源工具箱 LiBai 一同使用，打开方式为：export ONEFLOW_ENABLE_LAZY_SEPARATE_COMPILE=1。(https://github.com/Oneflow-Inc/oneflow/pull/9920, https://github.com/Oneflow-Inc/oneflow/pull/10140, https://github.com/Oneflow-Inc/oneflow/pull/10141, https://github.com/Oneflow-Inc/oneflow/pull/10124, https://github.com/Oneflow-Inc/oneflow/pull/10102)

以下是在 128 卡 A100-PCIE-40GB 设备上，配合 LiBai 在 GPT2 模型上的测试结果：

并行方式	是否开启分离编译	执行计划编译时间
数据并行 (DP128 MP1 PP1)	否	超过 20 minutes
数据并行 (DP128 MP1 PP1)	是	108.21 s
3D 并行 (DP4 MP4 PP8)	否	445.16 s
3D 并行 (DP4 MP4 PP8)	是	82.88 s

3、函数式自动微分接口

新增一系列函数式自动微分相关接口支持，包括 jvp、vjp、hvp、vhp、jacobian、hessian。(https://github.com/Oneflow-Inc/oneflow/pull/10412, https://github.com/Oneflow-Inc/oneflow/pull/10428)

使用示例：

import oneflow as flow

# jacobian example
def exp_reducer(x):
    return x.exp().sum(dim=1)

input = flow.rand(2, 2)
jac_rslt = flow.autograd.functional.jacobian(exp_reducer, input)

# vhp example
def pow_reducer(x):
    return x.pow(3).sum()

input = flow.rand(2, 2)
v = flow.ones(2, 2)
vhp_rslt = flow.autograd.functional.vhp(pow_reducer, input, v)

4、Insight模块

新增 Insight 模块，支持可视化地展示埋点区间内 kernel 调用、执行时间、速度等信息。(https://github.com/Oneflow-Inc/oneflow/pull/10370)

使用方法如下：

步骤一：使用 OneFlow Profiler 模块在代码中设置埋点区间。
步骤二：运行代码并使用 NVIDIA Nsight Systems 生成 sqlite 后缀文件。
步骤三：使用 OneFlow Insight 模块生成 json 文件。
步骤四：在网址 chrome://tracing/ 或 edge://tracing/ 中打开 json 文件得到可视化界面。

更详细的介绍可参考：https://github.com/Oneflow-Inc/oneflow/tree/master/python/oneflow/utils/insight#usage

5、LiBai版本更新

大规模模型训练开源工具箱 LiBai 功能升级，发布新版本 v0.3.0，原生支持大语言模型 Llama2 和 ChatGLM2 的 finetune 和分布式推理，支持 full finetune、adapter finetune、lora finetune，可使用 lm-eval-harness 对语言模型进行评测验证。
ChatGLM 和 Llama2 的分布式训练和推理支持情况如下：

使用示例：

# full finetune
bash tools/train.sh projects/Llama/train_net.py projects/Llama/configs/llama_sft.py 8

# adapter finetune
bash tools/train.sh projects/Llama/adapter/train_net.py projects/Llama/adapter/adapter_sft.py 8

# inference
bash tools/infer.sh projects/Llama/pipeline.py 8

# eval
python projects/Llama/utils/eval_adapter.py

6、其他新特性

新增 FFT 相关算子。(https://github.com/Oneflow-Inc/oneflow/pull/10027)
新增 zeta 算子。(https://github.com/Oneflow-Inc/oneflow/pull/10189)
新增 tril_ 算子。(https://github.com/Oneflow-Inc/oneflow/pull/9996)
新增 clone 算子。(https://github.com/Oneflow-Inc/oneflow/pull/9800)
新增 frac、frac_ 算子。(https://github.com/Oneflow-Inc/oneflow/pull/9979)
新增 exp2 算子。(https://github.com/Oneflow-Inc/oneflow/pull/9958)
新增 rrelu 算子。(https://github.com/Oneflow-Inc/oneflow/pull/9736)
新增 lgamma 反向算子。(https://github.com/Oneflow-Inc/oneflow/pull/10177)
新增 digamma 算子。(https://github.com/Oneflow-Inc/oneflow/pull/10066)
新增 trigamma 算子。(https://github.com/Oneflow-Inc/oneflow/pull/10117)
新增 bitwise_not 算子。(https://github.com/Oneflow-Inc/oneflow/pull/9859)
新增 squared_relu 算子。(https://github.com/Oneflow-Inc/oneflow/pull/10316)
新增 skip_rms_norm 算子。(https://github.com/Oneflow-Inc/oneflow/pull/10036)
新增 multi_tensor_amp_grad_scaler 相关算子。(https://github.com/Oneflow-Inc/oneflow/pull/10071)
新增 bitwise_and、bitwise_or、bitwise_xor 算子。(https://github.com/Oneflow-Inc/oneflow/pull/9842)
新增 fused_attention_concat_past_key_value 算子。(https://github.com/Oneflow-Inc/oneflow/pull/9963)
新增 fused_multi_head_attention_inference_v2 算子。(https://github.com/Oneflow-Inc/oneflow/pull/9933)
新增 fused_codegeex_qkv_reshape 算子。(https://github.com/Oneflow-Inc/oneflow/pull/9927)
新增 fused_apply_rotary_emb 算子。(https://github.com/Oneflow-Inc/oneflow/pull/9914)
新增 skip_layer_norm 算子。(https://github.com/Oneflow-Inc/oneflow/pull/9906)
新增 groupwise_dequantize、fused_linear_with_groupwise_quantized_weight 算子。(https://github.com/Oneflow-Inc/oneflow/pull/9900)
新增 fused_scale_mask_bias_softmax、fused_scale_mask_bias_softmax_grad 算子。(https://github.com/Oneflow-Inc/oneflow/pull/9867)
新增 depend 算子，用于描述计算图中依赖关系。(https://github.com/Oneflow-Inc/oneflow/pull/9807)
新增 real, imag, conj, conj_physical 复数数据类型相关算子。(https://github.com/Oneflow-Inc/oneflow/pull/10034, https://github.com/Oneflow-Inc/oneflow/pull/10281)
新增 nms 算子 cpu 支持。(https://github.com/Oneflow-Inc/oneflow/pull/10225)
新增 cast 算子对 bool to int16 数据类型转换支持。(https://github.com/Oneflow-Inc/oneflow/pull/10211)
新增 arange 算子对 fp16 数据类型的支持。(https://github.com/Oneflow-Inc/oneflow/pull/10019)
新增 adaptive_avg_pool 算子对 fp16 数据类型的支持。(https://github.com/Oneflow-Inc/oneflow/pull/10004)
新增 nonzero 算子对 fp16 数据类型的支持。(https://github.com/Oneflow-Inc/oneflow/pull/9826)
新增 exponential 算子对 half 数据类型的支持。(https://github.com/Oneflow-Inc/oneflow/pull/10005)
新增 arg_sort、top_k 算子对 half 数据类型的支持。(https://github.com/Oneflow-Inc/oneflow/pull/10000)
新增 add、sub、mul、mm、sqrt、div 等算子对复数数据类型支持。(https://github.com/Oneflow-Inc/oneflow/pull/10269, https://github.com/Oneflow-Inc/oneflow/pull/10136, https://github.com/Oneflow-Inc/oneflow/pull/10284, https://github.com/Oneflow-Inc/oneflow/pull/10049)
新增基础 binary 算子对不连续内存输入张量的支持。(https://github.com/Oneflow-Inc/oneflow/pull/9986)
新增虚拟 jit 接口，支持对 import 而未实际使用该接口的用户代码 mock_torch。(https://github.com/Oneflow-Inc/oneflow/pull/10395)
新增 mem_get_info 接口，用于返回指定 cuda 设备的总体和空闲内存信息。(https://github.com/Oneflow-Inc/oneflow/pull/10398)
新增 tensor.new 接口。(https://github.com/Oneflow-Inc/oneflow/pull/9881)
新增 tensor.is_cpu 接口。(https://github.com/Oneflow-Inc/oneflow/pull/10172)
新增 tensor.is_view 接口。(https://github.com/Oneflow-Inc/oneflow/pull/10101)
新增 tensor.data_ptr 接口。(https://github.com/Oneflow-Inc/oneflow/pull/10111, https://github.com/Oneflow-Inc/oneflow/pull/10139)
新增 tensor.baddbmm 接口。(https://github.com/Oneflow-Inc/oneflow/pull/9918)
新增 special.erf、special.erfc 等接口。(https://github.com/Oneflow-Inc/oneflow/pull/9982)
新增 layout 和 frombuffer 接口。(https://github.com/Oneflow-Inc/oneflow/pull/10171)
新增 prune 相关接口。(https://github.com/Oneflow-Inc/oneflow/pull/9730)
新增 utils.model_zoo 接口。(https://github.com/Oneflow-Inc/oneflow/pull/10183)
新增 get_rng_state 和 get_rng_state_all 接口。(https://github.com/Oneflow-Inc/oneflow/pull/9760)
新增 set_rng_state 和 set_rng_state_all 接口。(https://github.com/Oneflow-Inc/oneflow/pull/10250)
新增对 float16 数据类型支持。(https://github.com/Oneflow-Inc/oneflow/pull/9697)
新增对 char 和 short 数据类型支持。(https://github.com/Oneflow-Inc/oneflow/pull/10086)
新增对 complex64 和 complex128 数据类型支持。(https://github.com/Oneflow-Inc/oneflow/pull/9987)
新增 Transform Dialect 到 MLIR codegen 流程中。(https://github.com/Oneflow-Inc/oneflow/pull/10224, https://github.com/Oneflow-Inc/oneflow/pull/10227)
新增对 matmul 算子的代码生成支持。(https://github.com/Oneflow-Inc/oneflow/pull/10283)
新增对 softmax 算子的代码生成支持。(https://github.com/Oneflow-Inc/oneflow/pull/10263, https://github.com/Oneflow-Inc/oneflow/pull/10272)
新增对 transform.oneflow.apply_patterns 算子的代码生成支持。(https://github.com/Oneflow-Inc/oneflow/pull/10255)
新增 MLIR codegen 流程中对 byte attr 支持。(https://github.com/Oneflow-Inc/oneflow/pull/10276)
新增 extra_libs 功能到 mock_torch 模块，使其可以实现 flowvision 去模拟 torchvision 的功能。(https://github.com/Oneflow-Inc/oneflow/pull/10223)
新增 lazy 参数到 mock_torch 模块，对不存在的接口会返回一个假对象而不立即报错。(https://github.com/Oneflow-Inc/oneflow/pull/9876)
新增 skip_init 功能，并引入 meta device。(https://github.com/Oneflow-Inc/oneflow/pull/10008)
新增 HostMemoryInput机制，将算子某个输入定义为 HostMemoryInput 类型后可以在 kernel 的 host 函数体内访问数据。(https://github.com/Oneflow-Inc/oneflow/pull/9928)
新增 nccl 逻辑运算的融合机制，可以降低 ZERO 等场景，过多碎 nccl 导致同步开销太大降低训练速度的问题。(https://github.com/Oneflow-Inc/oneflow/pull/9879)
新增张量运算的重计算机制。(https://github.com/Oneflow-Inc/oneflow/pull/9861)
新增 backward_hook、register_full_backward_hook、register_state_dict_pre_hook 支持。(https://github.com/Oneflow-Inc/oneflow/pull/9837, https://github.com/Oneflow-Inc/oneflow/pull/9710)
新增 stochastic weight averaging 算法相关内容到 optimizers 模块。(https://github.com/Oneflow-Inc/oneflow/pull/9781)
新增计算图层面的拉直算法。(https://github.com/Oneflow-Inc/oneflow/pull/9718, https://github.com/Oneflow-Inc/oneflow/pull/9748)
新增 DelayVariableOpExecutionPass 计算图优化 pass。(https://github.com/Oneflow-Inc/oneflow/pull/9745)
新增 MulCastPattern 算子融合规则。(https://github.com/Oneflow-Inc/oneflow/pull/9715)
新增环境变量 ONEFLOW_ENABLE_GLOBAL_INPUTS_WITH_INCONSISTENT_PLACEMENT，控制是否自动将算子用到的 global_tensor 通过 to_global 操作放到最大的 rank 上。(https://github.com/Oneflow-Inc/oneflow/pull/10073)
新增环境变量 ONEFLOW_EAGER_NCCL_USE_COMPUTE_STREAM 用于控制eager 模式下 nccl 和普通的计算是否在同一个stream上，默认值为false。(https://github.com/Oneflow-Inc/oneflow/pull/10230)
新增环境变量 VLOG_REMAT 处理动态图重计算的日志并对接 ComputeComplexityFn 估计 op 计算时间。(https://github.com/Oneflow-Inc/oneflow/pull/10212)
新增环境变量 ENABLE_ACTOR_DEBUG_LOG 用于打印当前 rank 上 actor 收发消息、执行的详细日志。(https://github.com/Oneflow-Inc/oneflow/pull/10081)
新增环境变量 ONEFLOW_RUN_GRAPH_BY_VM 用于控制是否使用 VM 来运行静态图 nn.Graph。(https://github.com/Oneflow-Inc/oneflow/pull/9884)
新增环境变量 ONEFLOW_DISABLE_MOCK_TORCH 用于控制是否让 mock_torch 功能失效。(https://github.com/Oneflow-Inc/oneflow/pull/9805)
新增环境变量 ONEFLOW_VM_MULTI_THREAD 用于控制 vm 中使用的线程数。(https://github.com/Oneflow-Inc/oneflow/pull/9698)
新增二阶优化器 lbfgs 支持。(https://github.com/Oneflow-Inc/oneflow/pull/10265)

功能改进

1、Eager 运行时优化与重构

对 Eager 运行时做了一系列优化与重构，主要包括：

统一系统内存池，打通同设备下的所有分配器的内存资源。(https://github.com/Oneflow-Inc/oneflow/pull/8591)
对接 CUDA 原生接口，加速 kernel launch。(https://github.com/Oneflow-Inc/oneflow/pull/8571)
优化指令调度机制，降低系统负担。(https://github.com/Oneflow-Inc/oneflow/pull/8796)
引入指令融合机制，加速指令分发。(https://github.com/Oneflow-Inc/oneflow/pull/7399)
优化 Autograd 构图部分的速度。(https://github.com/Oneflow-Inc/oneflow/pull/8606)
优化op推导过程，加速kernel执行。(https://github.com/Oneflow-Inc/oneflow/pull/8672, https://github.com/Oneflow-Inc/oneflow/pull/8619, https://github.com/Oneflow-Inc/oneflow/pull/8662)
合并eager运行时中的冗余概念，解耦Instruction与Stream。(https://github.com/Oneflow-Inc/oneflow/pull/8583, https://github.com/Oneflow-Inc/oneflow/pull/8590, https://github.com/Oneflow-Inc/oneflow/pull/7607)

可以通过一些环境变量设定 Eager 运行时行为：

环境变量	意义	默认值
ONEFLOW_VM_COMPUTE_ON_WORKER_THREAD	是否在 worker 线程上完成计算	true
ONEFLOW_VM_MULTI_THREAD	是否使用多线程协同执行 Eager 运算	true
ONEFLOW_VM_ENABLE_STREAM_WAIT	多 stream 间的依赖是否使用 stream_wait 机制	true
ONEFLOW_VM_ENABLE_SCHEDULE_YIELD	是否使用 yield 机制减少 scheduler 线程 busy wait 程度	true
ONEFLOW_EAGER_ENABLE_LOCAL_INFER_CACHE	计算过程中是否缓存算子输出的元信息	true
ONEFLOW_VM_WORKER_THREAD_LIMIT	worker 线程的个数	16
ONEFLOW_VM_PENDING_HANDLE_WINDOW_SIZE	vm 融合指令的最大 size	10
ONEFLOW_VM_BLOCKING_DEBUG_INSTRUCTIONS_DISPLAY_LIMIT	vm 执行超时时打印未处理指令的数量	1000

2、OneFlow Serving功能升级

OneFlow Serving 功能升级，在原有支持 OneFlow Cpp 后端的基础上，新增支持 OneFlow Python 后端和 OneFlow Lite 后端。

使用 OneFlow Cpp 后端可以在脱离 Python 的环境中部署以达到最高的性能。
使用 OneFLow Lite 后端可以实现在端侧设备上的部署。
使用 OneFlow Python 后端可以以极小的迁移代价完成复杂模型的部署。

使用方法参考：https://github.com/Oneflow-Inc/serving/blob/main/README.md

3、其他功能改进

改进部分代码实现以支持 cuda 12.x 版本。(https://github.com/Oneflow-Inc/oneflow/pull/10367)
改进 glu 算子实现，支持无bias 输入。(https://github.com/Oneflow-Inc/oneflow/pull/9874)
改进池化算子实现，支持 channels_last 参数。(https://github.com/Oneflow-Inc/oneflow/pull/10242)
改进 flip 算子实现，针对 dim = -1 时候访存无法合并的情况进行优化。(https://github.com/Oneflow-Inc/oneflow/pull/10310)
改进 bincount 算子实现，实现优化加速。(https://github.com/Oneflow-Inc/oneflow/pull/10308)
改进 index_add 算子实现，根据 index 的长度派发不同的实现逻辑以改善索引比较小的时候的性能。(https://github.com/Oneflow-Inc/oneflow/pull/9751)
改进 topk 算子实现，优化 batch_size 是1时的性能。(https://github.com/Oneflow-Inc/oneflow/pull/10009)
改进 conv、arange 等算子实现，支持启用cuda graph。(https://github.com/Oneflow-Inc/oneflow/pull/9761)
改进 upsample 算子实现，增加对输入/输出大小检查。(https://github.com/Oneflow-Inc/oneflow/pull/9737)
改进 grouped_matmul_bias 算子实现，增加张量并行的 sbp 推导规则。(https://github.com/Oneflow-Inc/oneflow/pull/9934)
改进 reshape 算子实现，增加对 nd sbp 推导规则。(https://github.com/Oneflow-Inc/oneflow/pull/9858)
改进 mask_fill 和 in_top_k 算子的报错信息并完善测试样例。(https://github.com/Oneflow-Inc/oneflow/pull/10062)
改进 tanh 算子的高阶微分规则，优化三阶微分下的性能。(https://github.com/Oneflow-Inc/oneflow/pull/10188, https://github.com/Oneflow-Inc/oneflow/pull/10237)
改进 conv 接口实现，支持 device 和 dtype 参数。(https://github.com/Oneflow-Inc/oneflow/pull/10228)
改进 conv 接口实现，支持对输入自动扩展维度。(https://github.com/Oneflow-Inc/oneflow/pull/9721)
改进 sum 接口实现，支持 dtype 参数。(https://github.com/Oneflow-Inc/oneflow/pull/10204)
改进 softmax 接口实现，支持 dtype 参数。(https://github.com/Oneflow-Inc/oneflow/pull/10069)
改进 maxpool 接口实现，支持 3D 输入张量。(https://github.com/Oneflow-Inc/oneflow/pull/10110)
改进 ctc_loss 接口实现，参数与 PyTorch 接口对齐。(https://github.com/Oneflow-Inc/oneflow/pull/9887)
改进 copy 接口实现，支持输入和输出的 device 和 dtype 都不同的情况。(https://github.com/Oneflow-Inc/oneflow/pull/9888)
改进 grad 接口实现，支持 allow_unused 参数。(https://github.com/Oneflow-Inc/oneflow/pull/10251)
改进 load 接口实现，提供更加用户友好的报错信息。(https://github.com/Oneflow-Inc/oneflow/pull/10138)
改进 fused_matmul_bias 算子及接口实现，支持 alpha 和 beta 参数。(https://github.com/Oneflow-Inc/oneflow/pull/10015)
改进 normal 算子及接口实现以和 pytorch 行为对齐。(https://github.com/Oneflow-Inc/oneflow/pull/10185)
改进 fused attention 算子及接口实现，允许 pasti_key 和 past_value 为 None 的情况。(https://github.com/Oneflow-Inc/oneflow/pull/9977)
改进 fused_attention 算子及接口实现，增加对可变序列长度的支持。(https://github.com/Oneflow-Inc/oneflow/pull/9991)
改进 fused_multi_head_attention_inference 算子及接口实现，增加attn_bias 参数。(https://github.com/Oneflow-Inc/oneflow/pull/9853)
改进 bn 相关 functor 实现，融合bn_add_relu和bn_relu操作加速推理。(https://github.com/Oneflow-Inc/oneflow/pull/10239)
改进基于 MLIR CodeGen 流程，将 LLVM 版本更新到 16.0.0。(https://github.com/Oneflow-Inc/oneflow/pull/9985)
改进基于 MLIR codegen 流程，增加 AppendOneFlowStream、MgpuToOneFlowStream、CastOneFlowInputToSignlessPass pass。(https://github.com/Oneflow-Inc/oneflow/pull/10149, https://github.com/Oneflow-Inc/oneflow/pull/10151, https://github.com/Oneflow-Inc/oneflow/pull/10099)
改进基于 MLIR codegen 流程，通过链接 LibDevice 支持 NVVM IR 转化为 cubin。(https://github.com/Oneflow-Inc/oneflow/pull/10200)
改进基于 MLIR codegen 流程，支持在 MLIR 中使用 tmpbuffer 作为 MemPool。(https://github.com/Oneflow-Inc/oneflow/pull/10159)
改进基于 MLIR codegen 流程，支持 bufferizable 算子分发。(https://github.com/Oneflow-Inc/oneflow/pull/9787)
改进基于 MLIR codegen 流程，进行 ofmempool 等相关流程加速。(https://github.com/Oneflow-Inc/oneflow/pull/10152, https://github.com/Oneflow-Inc/oneflow/pull/10168, https://github.com/Oneflow-Inc/oneflow/pull/10184, https://github.com/Oneflow-Inc/oneflow/pull/10239)
改进 stacktrace 调用栈信息。(https://github.com/Oneflow-Inc/oneflow/pull/9912, https://github.com/Oneflow-Inc/oneflow/pull/9937, https://github.com/Oneflow-Inc/oneflow/pull/10260, https://github.com/Oneflow-Inc/oneflow/pull/10161)
改进随机数生成器部分实现，增加缓存避免每次调用重新生成。(https://github.com/Oneflow-Inc/oneflow/pull/10387)
改进 graph load 功能，支持将 graph 加载到新设备上。(https://github.com/Oneflow-Inc/oneflow/pull/10335)
改进 dummy 数组初始化实现，使用 fold 表达式。(https://github.com/Oneflow-Inc/oneflow/pull/10271)
改进 MemoryFormat 类组织形式，通过 cpython 暴露到 python 层中，支持使用 Tensor.to 接口更改张量的 MemoryFormat。(https://github.com/Oneflow-Inc/oneflow/pull/10181)
改进 steam、device、vm 部分实现以支持更多设备类型。(https://github.com/Oneflow-Inc/oneflow/pull/10166)
改进 MapAt 的报错信息，新增打印 key 的值。(https://github.com/Oneflow-Inc/oneflow/pull/10090)
改进 OOM 报错信息，支持区分 CUDA 和 CPU 设备且显示 size。(https://github.com/Oneflow-Inc/oneflow/pull/9938)
改进 CHECK_XX_OR_RETURN 宏报错信息。(https://github.com/Oneflow-Inc/oneflow/pull/9921)
改进 graph 相关报错信息。(https://github.com/Oneflow-Inc/oneflow/pull/9821)
改进卷积算子相关报错信息。(https://github.com/Oneflow-Inc/oneflow/pull/9707)
改进模型初始化方式，避免额外的开销。(https://github.com/Oneflow-Inc/oneflow/pull/10088)
改进 thread manager 实现，可以兼容不限制线程、master 作为线程、n个线程的三种使用场景。(https://github.com/Oneflow-Inc/oneflow/pull/10060)
改进 numpy 数组释放方式，在主线程中释放以减少耗时的 gil 请求。(https://github.com/Oneflow-Inc/oneflow/pull/10050)
改进 graph save runtime_state_dict 实现，提升性能并修复相关问题。(https://github.com/Oneflow-Inc/oneflow/pull/10016)
改进形如 Tensor.foo(*args) 接口不同调用方式的解析，使用统一的 PyParseArgs 函数完成。(https://github.com/Oneflow-Inc/oneflow/pull/9983)
改进 ArgsTree 类实现，支持任意输出类型并进行文件位置迁移。(https://github.com/Oneflow-Inc/oneflow/pull/9846)
改进内存分配机制，实现按 stream 有序分配。(https://github.com/Oneflow-Inc/oneflow/pull/9818)

改动与修复

1、功能改动

移除 deallocate context。(https://github.com/Oneflow-Inc/oneflow/pull/10143)
移除图编译中的调试编译模式。(https://github.com/Oneflow-Inc/oneflow/pull/10145)
移除不再使用的 MemChain merge 的逻辑。(https://github.com/Oneflow-Inc/oneflow/pull/10097)
移除一些分布式相关的环境变量的默认设置。(https://github.com/Oneflow-Inc/oneflow/pull/9803)
重构 lazy 模式下的 collective boxing 实现。(https://github.com/Oneflow-Inc/oneflow/pull/10098)
重构 EagerCclS2S 的注册。(https://github.com/Oneflow-Inc/oneflow/pull/10100)
重构 collective_boxing_executor_backend 的实现。(https://github.com/Oneflow-Inc/oneflow/pull/10082)
重构使用 VM 跑 global nn.graph 的实现。(https://github.com/Oneflow-Inc/oneflow/pull/10048)
重构 local to global 相关接口实现。(https://github.com/Oneflow-Inc/oneflow/pull/9870)
重构 MLIR codegen 流程中算子分发 dialect 实现。(https://github.com/Oneflow-Inc/oneflow/pull/9693)
重构 random generator 和 distribution kernels 实现。(https://github.com/Oneflow-Inc/oneflow/pull/9691)
重构 fast_atomic_add 算子实现。(https://github.com/Oneflow-Inc/oneflow/pull/9680)
重构 glog 中的错误检查相关宏定义。(https://github.com/Oneflow-Inc/oneflow/pull/10176)
重构 random generator 实现。(https://github.com/Oneflow-Inc/oneflow/pull/10025)
重构部分 elementwise primitive 的实现。(https://github.com/Oneflow-Inc/oneflow/pull/9857)
重构部分 device 描述相关代码。(https://github.com/Oneflow-Inc/oneflow/pull/9791)
重构 ParseDeviceString 和 ParseDeviceNameConf 实现。(https://github.com/Oneflow-Inc/oneflow/pull/9833)
重构 ActorMsg 相关实现，引入 IBVerbsActorMsgWrapper 封装以减少 ActorMsg 的大小。(https://github.com/Oneflow-Inc/oneflow/pull/9762)
重构 save 和 load 接口实现，迁移保存 Graph 逻辑的方法到 _save_graph 函数，添加部分 _open* 辅助类区分路径和内存， save 支持将权重保存到 BytesIO 中，load 支持文件流。(https://github.com/Oneflow-Inc/oneflow/pull/10021)
重构部分 tensor 相关接口实现，代码从 python 层迁移到 C++ 层。(https://github.com/Oneflow-Inc/oneflow/pull/9990, https://github.com/Oneflow-Inc/oneflow/pull/9964)
升级项目使用的 PyBind 版本至 2.11.1。(https://github.com/Oneflow-Inc/oneflow/pull/10391)

2、问题修复

修复 cmake 文件中动态链接默认设置以避免 llvm15 链接错误。(https://github.com/Oneflow-Inc/oneflow/pull/10373, https://github.com/Oneflow-Inc/oneflow/pull/10131)
修复基于 MLIR codegen 中 cast 相关 bug。(https://github.com/Oneflow-Inc/oneflow/pull/10105)
修复 Module._apply 函数中对 cpg attr 处理的逻辑问题。(https://github.com/Oneflow-Inc/oneflow/pull/10343)
修复 DummyModule 在 attr 为 __mro_entries__ 情况下无法被继承的问题。(https://github.com/Oneflow-Inc/oneflow/pull/9976)
修复 full op 中 _handle_size_arg 对传入 size 判断的问题。(https://github.com/Oneflow-Inc/oneflow/pull/9975)
修复通过命令行启动 mock 后环境变量残留导致后续 api 方式的 mock 参数错误的问题。(https://github.com/Oneflow-Inc/oneflow/pull/9970)
修复两个进程异常时无法退出的问题。(https://github.com/Oneflow-Inc/oneflow/pull/10054)
修复了分组量化 sbp 推导的 bug。(https://github.com/Oneflow-Inc/oneflow/pull/10132)
修复 GroupedMatmulFunctor 中的 kMaxInputCount 检查问题。(https://github.com/Oneflow-Inc/oneflow/pull/10322)
修复 0-size tensor broadcast 的问题。(https://github.com/Oneflow-Inc/oneflow/pull/10186)
修复使用 shared_graph 时 double 类型 attr 没有更新的问题。(https://github.com/Oneflow-Inc/oneflow/pull/10279)
修复 GetItemInScalarTensor 中的数据类型错误。(https://github.com/Oneflow-Inc/oneflow/pull/10226)
修复 GroupNorm 梯度问题，仅当 gamma 和 beta 需要梯度时，才调用 GroupNormParamGrad。(https://github.com/Oneflow-Inc/oneflow/pull/10045)
修复 global mode 在读取 placement 为部分 ranks 的 tensor 时会报错的问题。(https://github.com/Oneflow-Inc/oneflow/pull/10056)
修复 checkpointing 在 PP 下可能会有跨出 rank 控制的边，从而导致影响分离编译下的 task graph 构建的问题。(https://github.com/Oneflow-Inc/oneflow/pull/10057)
修复同时使用 3D 并行和打开 activation checkpointing 时的 bug。(https://github.com/Oneflow-Inc/oneflow/pull/10031)
修复 AutoMixedPrecision pass 在其他非 cuda 设备上的适配 bug 和 LayerNorm Module相关设备组合的 bug。(https://github.com/Oneflow-Inc/oneflow/pull/10026)
修复 scatter 算子 reduce 参数默认值设置问题。(https://github.com/Oneflow-Inc/oneflow/pull/10002)
修复 mock.disable 时，有些 Torch 变量依旧内置于其他引用的 globals 里而导致 disable 不彻底的问题。(https://github.com/Oneflow-Inc/oneflow/pull/9989)
修复 vm::TensorStorage 析构问题。(https://github.com/Oneflow-Inc/oneflow/pull/9962)
修复 offload，解决小 tensor 释放清理不出 Cuda Memory 的问题。(https://github.com/Oneflow-Inc/oneflow/pull/9974)
修复线程不安全导致的 Python stack getter 偶发 segmentation fault 的问题。(https://github.com/Oneflow-Inc/oneflow/pull/9955)
修复分离编译场景下的 set 中元素查找不到的问题。(https://github.com/Oneflow-Inc/oneflow/pull/9952)
修复 fused_multi_head_attention 算子，对齐 qkv 和 output_layout。(https://github.com/Oneflow-Inc/oneflow/pull/9950)
修复 random 系列算子在 graph 和 checkpointing 中 seed 表现不一致的问题。(https://github.com/Oneflow-Inc/oneflow/pull/9941)
修复 Eager 模式下 parameter reload 失败问题。(https://github.com/Oneflow-Inc/oneflow/pull/9935)
修复 mock torch lazy 功能特定情况下死循环的问题。(https://github.com/Oneflow-Inc/oneflow/pull/9926)
修复 stft_kernel.cu 文件中的代码默认情况下不会被编译的问题。(https://github.com/Oneflow-Inc/oneflow/issues/9922)
修复 order_in_graph 在分离编译下，由于 TaskGraph 不是完整的图。(缺少其他 rank 的信息)导致拓扑序失效造成死锁、内存分配写错的 BUG。(https://github.com/Oneflow-Inc/oneflow/pull/9909 )
修复 xrt 编译找不到 fmt 的问题。(https://github.com/Oneflow-Inc/oneflow/pull/9894)
修复 local to global 过程中，当 sbp 为 B 时，各进程显存分配不平衡的问题。(https://github.com/Oneflow-Inc/oneflow/pull/9852)
修复 CTCLoss 的第三个参数相关 OneFlow 和 PyTorch 行为不对齐的问题。(https://github.com/Oneflow-Inc/oneflow/pull/9845)
修复 thread_global_id 和 rank_group_scope 初始化相关问题。(https://github.com/Oneflow-Inc/oneflow/pull/9841)
修复 dropout 算子实现中 inplace 处理相关错误。(https://github.com/Oneflow-Inc/oneflow/pull/9808)
修复 load 功能在加载 PyTorch 保存的非张量对象时的错误。(https://github.com/Oneflow-Inc/oneflow/pull/9804)
修复连续内存/显存分配策略之间的冲突问题。(https://github.com/Oneflow-Inc/oneflow/pull/9786)
修复 EagerBlobObject::ByteSizeOfBlobBody 内存分配时未考虑非连续情况的问题。(https://github.com/Oneflow-Inc/oneflow/pull/9782)
修复 fill_ 算子在 autocast 时的 dtype infer 错误。(https://github.com/Oneflow-Inc/oneflow/pull/9776)
修复 fused_glu 算子 sbp 推导规则相关问题。(https://github.com/Oneflow-Inc/oneflow/pull/10108)
修复调用 nn.Graph.__map_io 的相关问题。(https://github.com/Oneflow-Inc/oneflow/pull/10084)
修复 set_grad_mode 接口和 PyTorch 行为不一致的问题。(https://github.com/Oneflow-Inc/oneflow/pull/10059)
修复 load 接口中 map_location 参数相关的一个问题并支持传入 lambda 函数。(https://github.com/Oneflow-Inc/oneflow/pull/10052)
修复 view 模式下的 unsqueeze 操作后 stride 推断错误。(https://github.com/Oneflow-Inc/oneflow/pull/9775)
修复 conv op 在 unbatched 输入且有 bias 时的问题，为 deconv op 添加 unbatched 输入支持。(https://github.com/Oneflow-Inc/oneflow/pull/9740)
修复 trunc_normal_ 实现的逻辑错误。(https://github.com/Oneflow-Inc/oneflow/pull/9711)
修复 topk 算子 dim 参数默认值的问题。(https://github.com/Oneflow-Inc/oneflow/pull/9703)
修复打印静态图时部分网络的 placement 为 CPU 的问题。(https://github.com/Oneflow-Inc/oneflow/pull/9770)
修复 trt_flash_attention 的 include 路径和原生 flash attention 路径冲突问题。(https://github.com/Oneflow-Inc/oneflow/pull/9750)
修复 is_shutting_down 和 gil 引起的 stack getter 段错误。(https://github.com/Oneflow-Inc/oneflow/pull/9681)
修复分离编译特性在分布式单测中暴露相关的问题。(https://github.com/Oneflow-Inc/oneflow/pull/9749)
修复拉直算法实现中内存处理相关问题。(https://github.com/Oneflow-Inc/oneflow/pull/9746)
修复执行流程中一个死锁问题。(https://github.com/Oneflow-Inc/oneflow/pull/9738)
修复 DummyModule 在 isinstance 判断时报错的相关问题。(https://github.com/Oneflow-Inc/oneflow/pull/10207)
修复在引入 llvm::SmallVector 时错误覆盖默认 size 的行为。(https://github.com/Oneflow-Inc/oneflow/pull/9932)
修复非连续内存张量内存大小计算错误问题。(https://github.com/Oneflow-Inc/oneflow/pull/9819)
修复在 TensorStorage 析构函数中调用 CHECK_JUST 的问题。(https://github.com/Oneflow-Inc/oneflow/pull/9752)

性能

1、OneFlow compile_from_torch VS PyTorch compile

对 ResNet50 模型和 Faster RCNN 模型的 backbone 部分使用 OneFlow compile_from_torch 和 PyTorch compile 接口进行编译并执行，测试不同 shape 输入时的推理性能，结果如下表：

模型	输入 shape	PyTorch compile	OneFlow compile_from_torch	dynamic	测试时机
ResNet50	(1, 3, 512, 512)	21.328 s	3.205 s	False	首次编译执行
ResNet50	(2, 3, 896, 512)	14.167 s	1.523 s	False	连续编译执行
ResNet50	(2, 3, 512, 896)	13.364 s	1.402 s	False	连续编译执行
ResNet50	(3, 3, 896, 896)	15.056 s	1.539 s	False	连续编译执行
ResNet50	(2, 3, 1024, 896)	14.167 s	1.500 s	False	连续编译执行
ResNet50	(2, 3, 896, 1024)	12.891 s	1.494 s	False	连续编译执行
ResNet50	(6, 3, 1024, 1024)	14.859 s	1.872 s	False	连续编译执行
ResNet50	(1, 3, 512, 512)	170.446 s	3.143 s	True	首次编译执行
ResNet50	(2, 3, 896, 512)	185.672 s	0.851 s	True	连续编译执行
ResNet50	(2, 3, 512, 896)	0.089 s	0.836 s	True	连续编译执行
ResNet50	(3, 3, 896, 896)	0.084 s	0.980 s	True	连续编译执行
ResNet50	(2, 3, 1024, 896)	0.077 s	0.942 s	True	连续编译执行
ResNet50	(2, 3, 896, 1024)	0.080 s	0.931 s	True	连续编译执行
ResNet50	(6, 3, 1024, 1024)	0.084 s	1.406 s	True	连续编译执行
Faster RCNN	(1, 3, 512, 512)	18.224 s	5.483 s	False	首次编译执行
Faster RCNN	(2, 3, 896, 512)	9.200 s	3.011 s	False	连续编译执行
Faster RCNN	(2, 3, 512, 896)	9.331 s	3.025 s	False	连续编译执行
Faster RCNN	(3, 3, 896, 896)	9.301 s	2.854 s	False	连续编译执行
Faster RCNN	(2, 3, 1024, 896)	9.290 s	2.805 s	False	连续编译执行
Faster RCNN	(2, 3, 896, 1024)	9.123 s	2.851 s	False	连续编译执行
Faster RCNN	(6, 3, 1024, 1024)	9.377 s	3.180 s	False	连续编译执行
Faster RCNN	(1, 3, 512, 512)	25.444 s	5.430 s	True	首次编译执行
Faster RCNN	(2, 3, 896, 512)	25.381 s	1.899 s	True	连续编译执行
Faster RCNN	(2, 3, 512, 896)	0.116 s	1.886 s	True	连续编译执行
Faster RCNN	(3, 3, 896, 896)	1.982 s	1.793 s	True	连续编译执行
Faster RCNN	(2, 3, 1024, 896)	0.114 s	1.803 s	True	连续编译执行
Faster RCNN	(2, 3, 896, 1024)	0.111 s	1.778 s	True	连续编译执行
Faster RCNN	(6, 3, 1024, 1024)	0.143 s	2.110 s	True	连续编译执行

对 Stable Diffusion 模型的 unet 部分使用 OneFlow compile_from_torch 和 PyTorch compile 接口进行编译并执行，测试不同 shape 输出时的推理性能，结果如下表：

模型	输出 shape	PyTorch compile	OneFlow compile_from_torch	dynamic	测试时机
Stable Diffusion	(2, 512, 512)	103.701 s	63.670 s	False	首次编译执行
Stable Diffusion	(1, 512, 768)	95.137 s	53.864 s	False	连续编译执行
Stable Diffusion	(2, 768, 512)	90.259 s	55.271 s	False	连续编译执行
Stable Diffusion	(1, 768, 768)	90.196 s	51.590 s	False	连续编译执行
Stable Diffusion	(2, 512, 512)	275.660 s	57.117 s	True	首次编译执行
Stable Diffusion	(1, 512, 768)	345.774 s	43.752 s	True	连续编译执行
Stable Diffusion	(2, 768, 512)	349.835 s	47.653 s	True	连续编译执行
Stable Diffusion	(1, 768, 768)	7.224 s	45.720 s	True	连续编译执行
Stable Diffusion	(2, 512, 512)	4.088 s	2.831 s	False	后续执行
Stable Diffusion	(1, 512, 768)	3.296 s	2.325 s	False	后续执行
Stable Diffusion	(2, 768, 512)	5.594 s	5.157 s	False	后续执行
Stable Diffusion	(1, 768, 768)	4.713 s	3.557 s	False	后续执行
Stable Diffusion	(2, 512, 512)	4.448 s	2.801 s	True	后续执行
Stable Diffusion	(1, 512, 768)	3.201 s	2.314 s	True	后续执行
Stable Diffusion	(2, 768, 512)	6.093 s	4.166 s	True	后续执行
Stable Diffusion	(1, 768, 768)	4.920 s	3.557 s	True	后续执行

结论：使用 OneFlow compile_from_torch 接口有相对于 PyTorch compile 接口平均更短的编译时间，另外得益于 OneFlow 框架中极致的算子优化，在 Stable Diffusion 模型上有更优的执行性能。

备注：测试使用 GPU 型号为 3090，PyTorch 版本为 v2.1.2，cuda 版本为 12.2。

2、OneFlow Eager vs PyTorch Eager

模型	GPU 型号	卡数	macro batch	PyTorch 性能(iter/s)	OneFlow 性能(iter/s)	加速比
ResNet50	3090	1	1	31.37	38.81	23.72%
ResNet50	3090	1	2	32.06	48.45	51.12%
ResNet50	3090	2	1	31.10	33.46	7.59%
ResNet50	3090	2	2	31.76	34.83	9.67%
ResNet50	A100	1	1	24.60	46.64	89.59%
ResNet50	A100	1	2	25.06	49.88	99.04%
ResNet50	A100	2	1	25.28	39.18	54.98%
ResNet50	A100	2	2	24.09	32.84	36.32%
Bert	3090	1	1	8.93	10.41	16.57%
Bert	3090	1	2	13.11	14.31	9.15%
Bert	3090	2	1	6.94	8.27	19.16%
Bert	3090	2	2	12.19	15.58	27.81%
Bert	A100	1	1	10.45	12.72	21.72%
Bert	A100	1	2	20.24	21.57	6.57%
Bert	A100	2	1	12.63	16.09	27.39%
Bert	A100	2	2	24.86	29.84	20.03%

结论：使用 OneFlow Eager 相对于 PyTorch Eager 在 ResNet50 和 Bert 两个模型小 batch 场景下有明显性能优势。

备注：测试使用PyTorch版本为 v2.1.0，cuda 版本为 12.1。

oneflow - Version 0.9.0

Published by jackalcooper almost 2 years ago

Version 0.9.0

OneFlow v0.9.0 release note

OneFlow v0.9.0 came out, welcome to install the new version for a better experience.

Highlights
Backwards Incompatible Change
New Features
Performance
Improvements
Bug fixes
Documentation
Edge Tools

Highlights

This update contains 640 commits and the following highlights:

With the addition of 86 new API interfaces and operators aligned with PyTorch and the fix of 104 bugs related to operator compatibility, OneFlow v0.9.0 provides better PyTorch API and model compatibility. In v0.9.0, users can migrate more PyTorch models to OneFlow with one click and gain faster performance.
- Allowing one-click migration of Stable Diffusion、GLM、YOLOv5 etc to OneFlow.
- More convenient model migration. Oneflow.load supports loading the torch.save models directly.
- With the newly added oneflow.mock_torch module and mock method, oneflow can migrate complex PyTorch models containing multiple scripts with one click without changing the original PyTorch script.
Global Tensor has added a series of interfaces and methods that are convenient for distributed programming, and fixed known related bugs.
The Graph released a new feature of automatic parallelism (version 1), which supports automatic search for the fastest SBP with a specified Placement. When writing distributed models with Global Tensor, users do not need to consider parallelism.
The Graph adds a series of optimizations related to memory, execution speed, pipeline masking, and compilation speed to improve performance and reduces memory overhead.
The Graph provides a series of functions to aid debugging, including analyzing memory logs, displaying the progress during the compilation stage, and the computation graph.
OneFlow IR provides more compilation optimization functions.
The error prompt of OneFlow is more user-friendly, which supports highlighting the error content and simplifies unnecessary information details inside the system. In this connection, you can visually learn about the location and type of the error.
A series of operator optimizations and system optimizations have been added, including Eager instruction scheduling, high-performance CUDA kernel, opening up of multiple memory pools, etc.

Backwards Incompatible Change

To solve the possible duplicate name conflict between Graph.Block.config and module user-defined attribute module.config, OneFlow redesigned the abstraction of Graph proxy Module/Tensor, thus introducing a breaking change: (https://github.com/Oneflow-Inc/oneflow/pull/9351 , https://github.com/Oneflow-Inc/oneflow/pull/9437，https://github.com/Oneflow-Inc/oneflow/pull/9607)

The attr and config attributes on Block are removed, and Block is renamed to Proxy;
Implementation plan: When added as members of nn.Graph, the original Eager Module and Tensor types will be packaged into the Proxy class, and the corresponding GraphModule and GraphTensor will be generated; nn.Graph will use Proxy in the subsequent composition For proxy execution, when the proxy is executed, the original eager type and graph type can be obtained from the Proxy. The naming refers to the naming of torch.fx.

	Eager primitive type	Graph type, base class Graph Block	Proxy execution type, the base class is called Proxy
Function	Supporting to get the original eager type	A Graph code block corresponding to GraphBlock stores the information required for graph execution, such as name/scope/lazy op or tensor and optimization switches of some sub-modules on the graph.	Proxy execution capability, using the same execution interface as Module and Tensor, but the behavior has changed, such as lazy, and the op that may be executed has also been rewritten.
Module type	Module	GraphModule	ProxyModule contains a Module member and a GraphModule member
Tensor type	Tensor	GraphTensor	ProxyTensor contains a Tensor member and a GraphTensor member

Here is an exmaple：

    import oneflow as flow
    import oneflow.nn as nn
    from oneflow.nn.graph import GraphModule
    linear = flow.nn.Linear(3, 8, False)
    class LinearGraph(nn.Graph):
        def __init__(self):
            super().__init__()
            # The type of linear is nn.Module. When added as an attribute of nn.Graph, it will be registered with nn.Graph.
            # self.linear has been wrapped as a ProxyModule.
            #self.linear.weight has been wrapped as a ProxyTensor.
            #nn.Graph will use ProxyModule to perform graph composition.
            self.linear = linear
            # There are two parts in ProxyModule, one is the original module and the other is GraphModule.
            self.linear.to(GraphModule)  # Get the corresponding GraphModule, on which you can do configuration related to graph optimization.
            # such as setting a pipeline stage for a module, and enabling pipeline parallelism. 
            self.linear.to(GraphModule).set_stage(id, placement)
            self.linear.to(nn.Module)  # get the corresponding original nn.Module.
            self.linear.weight.to(flow.Tensor)  # get the corresponding original Tensor.

Outdated interface in OneFlow v0.8.0:

import oneflow as flow
import oneflow.nn as nn
linear = flow.nn.Linear(3, 8, False)
class LinearGraph(nn.Graph):
    def __init__(self):
        super().__init__()
        self.linear = linear
        self.linear.config.set_stage(id, placement)  # set stage
        self.linear.config.activation_checkpointing = True  # set activation checkpointing
        self.linear.origin  # get the corresponding original nn.Module
        self.linear.weight.origin # get the corresponding original Tensor

New interface in OneFlow v0.9.0:

import oneflow as flow
import oneflow.nn as nn
from oneflow.nn.graph import GraphModule
linear = flow.nn.Linear(3, 8, False)
class LinearGraph(nn.Graph):
    def __init__(self):
        super().__init__()
        self.linear = linear
        self.linear.to(GraphModule).set_stage(id, placement)  # set stage
        self.linear.to(GraphModule).activation_checkpointing = True  # set activation checkpointing
        self.linear.to(nn.Module)  # get the corresponding original nn.Module
        self.linear.weight.to(flow.Tensor)  # get the corresponding original Tensor

New Features

Graph

Adds automatic parallelization feature for the first stage in Graph: (https://github.com/Oneflow-Inc/oneflow/pull/8891, https://github.com/Oneflow-Inc/oneflow/pull/9172 , https://github.com/Oneflow-Inc/oneflow/pull/9288)
- Automatic parallelism can be enabled by configuring self.config.enable_auto_parallel(True) in Graph. After it is enabled, you don't have to configure sbp, and the Graph will automatically find the optimal sbp combination.
- Here is an exmaple:
```
import oneflow as flow
class SubclassGraph(flow.nn.Graph):
    def __init__(self):
        super().__init__() # MUST be called
        # auto parallelism configuration
        self.config.enable_auto_parallel(True)
        # other configurations about auto parallelism
        # ......

    def build(self):
        pass
```
- For documentation see: https://oneflow.readthedocs.io/en/master/auto_parallel.html
Graph supports straightened algorithm optimization with memory priority, reducing the memory life cycle of each Tensor by adjusting the execution sequence to reduce the peak value of memory. (https://github.com/Oneflow-Inc/oneflow/pull/9094)
- With self.config.enable_straighten_algorithm("MemoryFirst"), the straightened algorithm with memory optimization can be enabled.
- The available modes are as follows: "MemoryFirst" / "SpeedFirst" / "Disable" / "OverlapCpuGpu"
- At the same time, Graph adds the algorithm "OverlapCpuGpu" that make CPU and GPU kernel overlap with each other as much as possible. (https://github.com/Oneflow-Inc/oneflow/pull/9278)
Graph provides generalized basic transmission, using nccl send/recv to realize fast communication for any NdSbp (2d, 3d,...), thus minimizing the transmission volume.(https://github.com/Oneflow-Inc/oneflow/pull/8437 , https://github.com/Oneflow-Inc/oneflow/pull/8783)
With autograd.Function, Graph is allowed to use custom op (https://github.com/Oneflow-Inc/oneflow/pull/8843).
You can use the Graph Optimizer through param_group["lr_scale"], supporting configuring the learning rate for the parameter of each module/layer. (https://github.com/Oneflow-Inc/oneflow/pull/9138)
Adds enable_multi_tensor_update optimization. Enabling by self.config.enable_multi_tensor_update(True), it will optimize the overhead of numerous broken parameters when updating the model. (https://github.com/Oneflow-Inc/oneflow/pull/9209, https://github.com/Oneflow-Inc/oneflow/pull/9252)
Adds enable_fused_model_update_cast optimization. Enabling by self.config.enable_fused_model_update_cast(True), it will speed up the training speed of the network by fusing Optimizer and fp16 cast when AMP is on. (https://github.com/Oneflow-Inc/oneflow/pull/9209)
Graph supports non-uniform segmentation under ND-SBP. (https://github.com/Oneflow-Inc/oneflow/pull/9310)
Graph supports LazyTensor's indexing feature.
(https://github.com/Oneflow-Inc/oneflow/pull/9334)
Adds enable_compress_memory interface. Enabling by self.config.enable_compress_memory(True), it will try to optimize the memory and iterate the video memory of the computation graph within a half hour. Finally, the minimum value close to the lower limit will be found. (https://github.com/Oneflow-Inc/oneflow/pull/9509)
Adds oneflow.utils.global_view.global_mode. It supports smooth migration from single-GPU code to multi-GPU code. This global_mode will create a global context with on/off support. In addition, it will set the default placement and sbp under the context and support various grammar of LocalTensor such as Tensor.device and Tensor.to(device). The source op created in this context will automatically generate the GlobalTensor and populate the default placement and sbp. This context enables the logic of the local tensor in the module to convert to global logic in a non-invasive manner.
- Here is an example:
- ```
import oneflow as flow
from oneflow.utils.global_view import global_mode

P_C = flow.placement("cpu", ranks=[0, 1])
P = flow.placement("cuda", ranks=[0, 1])
B = flow.sbp.broadcast
S0 = flow.sbp.split(0)
x = flow.ones((6, 8), placement=P_C, sbp=S0)

with global_mode(True, placement=P, sbp=B):
    device = linear_dp.weight.device
    x = x.to(device) # global tensor to device
    out = linear_dp(x)

    # The local tensor will be converted to global
    sample = flow.randn(out.shape, device="cpu").to(device)
```

Debug

Provides comprehensive memory analysis logs V2.0 (https://github.com/Oneflow-Inc/oneflow/pull/8565)
- export GLOG_v = 3 enables the environment variable to see the full memory analysis log in oneflow.INFO.
- Adds shape, dtype, life cycle, and order of application for release of all tensors in each memory block (Chunk, MemBlock), which helps to quickly find out whether the tensor that greatly affect occupied memory in each memory block is normal or not.
- The Checkpointing pass provides a log, recording tensors with Checkpoint.
Adds time_util to record the execution time of each module, actual physical memory occupied, and virtual memory occupied. (https://github.com/Oneflow-Inc/oneflow/pull/9164，https://github.com/Oneflow-Inc/oneflow/pull/9245)
Graph will display the compilation progress bar when the rank 0 calculation Graph is compiled when enabling such environment variables as debug(0) and ONEFLOW_NNGRAPH_ENABLE_PROGRESS_BAR=1. (https://github.com/Oneflow-Inc/oneflow/pull/9537)
The default log directory is removed (The directory will not be created and be written to log files by default.) The log directory print logs will be generated when in ONEFLOW_DEBUG_MODE=1. (https://github.com/Oneflow-Inc/oneflow/pull/9552 ， https://github.com/Oneflow-Inc/oneflow/pull/9575)

Eager

Adds parameter map_location to oneflow.load to support the placement or device of the specified loading model Tensor. (https://github.com/Oneflow-Inc/oneflow/pull/8666)
Adds the oneflow.async.thread to allow users to create a new thread for asynchronous programming. (https://github.com/Oneflow-Inc/oneflow/pull/8866 , https://github.com/Oneflow-Inc/oneflow/pull/9039 , https://github.com/Oneflow-Inc/oneflow/pull/9270)
oneflow.save supports saving ddp Module objects directly. (https://github.com/Oneflow-Inc/oneflow/pull/8856)
Adds oneflow.utils.checkpoint to support Checkpointing optimization under eager. (https://github.com/Oneflow-Inc/oneflow/pull/9053)
With the newly added oneflow.mock_torch module and mock method, the effect of one-click migration to oneflow can be realized without changing the original script of import torch. The benefit of this method is that all you need to do is add a new line instead of modifying the imports of files one by one (https://github.com/Oneflow-Inc/oneflow/pull/9160 , https://github.com/Oneflow-Inc/oneflow/pull/9256 , https://github.com/Oneflow-Inc/oneflow/pull/9442 , https://github.com/Oneflow-Inc/oneflow/pull/9473). You can use it with the following code:
- ```
import torch
from oneflow.mock_torch import mock
mock()
# torch code
# ...
```
- Supports mocks with scope, such as:
- ```
import torch
from oneflow.mock_torch import mock
with mock.enable():
    # torch code
    # ...
```
Supports autograd's backward graph visualization debug: When enabling ONEFLOW_DEBUG_MODE=1 environment variable, each backward computation will generate the AutogradEngine execution graph to the dot file in the log directory. As is shown in the figure, you can see the operators of backward execution and topologies, which provides an easy way for algorithm and R&D personnel to debug backward problems. (https://github.com/Oneflow-Inc/oneflow/pull/9412)

oneflow - Version 0.8.0

Published by jackalcooper over 2 years ago

OneFlow v0.8.0 Release Note

OneFlow v0.8.0 came out, welcome to install the new version for a better experience.

Highlights
Backwards Incompatible Change
Deprecations
New Features
Performance
Improvements
Bug fixes
Documentation

Highlights

This update contains 523 commits and the following highlights:

PyTorch compatible APIs have been further optimized, 68 new APIs aligned with PyTorch have been added, and 84 compatibility bugs between operator and interface have been fixed. More PyTorch models support being one-button transferred into OneFlow.
All operators support Global Tensor more completely and efficiently, 28 Global Tensor-related bugs have been fixed, and 180 operator unit tests have been newly added.
Graph's advanced features have been further optimized:
- In addition to the existing ZeRO-DP, Zero Redundancy Optimizer(ZeRO) can also be used in combination with MP parallelism, 2D parallelism, and 3D parallelism, which saves more memory overhead.
- Graph provided new pipeline parallelism API, which not only simplifies the pipeline parallelism configuration but also optimizes the performance of pipeline parallelism and 3D parallelism.
- Multi-dimensional debugging functionality in the logic graph, light plan physical graph, memory analysis, Python stack information, and others have been newly added, making Graph.debug more efficient.
Empowered by OneFlow v0.8.0 and LiBai v0.2.0, 3D parallelism speed under GPT and BERT witnesses a notable increase, and its training speed performance exceeds Megatron-LM with same configuration in multiple dimensions. For more details, please click here.
OneEmbedding has been released recently. It is an extension component designed for large-scale recommendation systems, boasting high efficiency, extensibility, flexibility, and other advantages.
Multi-Device adaptation: OneFlow v0.8.0 has provided a neat, efficient, and easily-extensible hardware abstraction layer called EP(Execution Provider) and defined a collection of basic computing interfaces called Primitive, allowing to re-implement kernels based on Primitive interface.
Added new debugging tool stacks: OneFlow-Profiler and AutoProf
- OneFlow-Profiler is a tool designed to collect performance information during framework execution. It can record the execution time of operators and system components, the allocation of memory and DRAM, and the corresponding input and parameters of operators. The information can help developers find out the main source of overhead in framework execution and thus implement targeted optimization.
- AutoProf is a framework designed to efficiently detect the alignment between OneFlow APIs and PyTorch APIs. Besides, it can automatically compare the performance results of OneFlow APIs and PyTorch APIs.
Significantly optimized the exception handling process in OneFlow API and improved the error message when APIs meet exceptions.
Significantly optimized the OneFlow API documentation: the API documentation has been restructured based on functionality. In addition to general operator APIs, oneflow.nn.graph, oneflow.embedding, oneflow.autograd and other modules in OneFlow and their environment variables have also been explained in detail.

Backwards Incompatible Change

Graph has been re-designed to configure ZeRO API, which saves configuration and learning cost for users. Besides, the latest ZeRO supports 2D mixed parallelism that contains model parallelism and pipeline parallelism, and 3D parallelism.(https://github.com/Oneflow-Inc/oneflow/pull/8036, https://github.com/Oneflow-Inc/oneflow/pull/8404, https://github.com/Oneflow-Inc/oneflow/pull/8464)

Outdated configuration method in OneFlow v0.7.0:

import oneflow as flow

class Graph(flow.nn.Graph):
    def __init__(self):
        super().__init__()
        self.linear = flow.nn.Linear(3, 8, False)
        self.config.set_zero_redundancy_optimizer_mode("distributed_split")
        if zero_stage > 1:
            # stage 2
            flow.boxing.nccl.enable_use_compute_stream(True)
            if zero_stage > 2:
                # stage 3
                flow.boxing.nccl.disable_group_boxing_by_dst_parallel(True)
    def build(self, x):
        return self.linear(x)

graph = Graph()

New interface in OneFlow v0.8.0:

import oneflow as flow

class Graph(flow.nn.Graph):
    def __init__(self):
        super().__init__()
        self.linear = flow.nn.Linear(3, 8, False)
        self.config.enable_zero(stage=2)
    def build(self, x):
        return self.linear(x)

graph = Graph()

Deprecations

Python API

The outdated parameter axis (remains compatible) in oneflow.sbp.split() has been uniformly changed into using dim to represent the slice dimension.(https://github.com/Oneflow-Inc/oneflow/pull/8411)

v0.7.0

oneflow.sbp.split(axis=0)

v0.8.0

oneflow.sbp.split(dim=0)

For the outdated pipeline parallelism configuration method self.module_layer_0.config.stage_id = 0 (this method is not suggested ), we have added a novel pipeline parallelism API config.set_stage, which optimizes pipeline parallelism performance as well as avoids implementing the input_tensor.to_global(placement=this_stage_placement) operation for all module input tensors at every stage. (https://github.com/Oneflow-Inc/oneflow/pull/8442)

v0.7.0

import oneflow as flow

B = [flow.sbp.broadcast]
P_0 = flow.placement(type = "cuda", ranks = [0, 1])
P_1 = flow.placement(type = "cuda", ranks = [2, 3])

class Graph(flow.nn.Graph):
    def __init__(self):
        super().__init__()
        self.m_stage0 = flow.nn.Linear(8, 8, False).to_global(placement=P_0, sbp=B)
        self.m_stage1 = flow.nn.Linear(8, 8, False).to_global(placement=P_1, sbp=B)
        # Set different module's stage id to hint the graph preparing right num of buffers in pipeline.
        self.m_stage0.config.stage_id = 0 
        self.m_stage1.config.stage_id = 1
        self.config.set_gradient_accumulation_steps(4)        

    def build(self, x):
        x = x.to_global(placement=P0, sbp=B)
        y = self.m_stage0(x)
        # Move tensor between different pipeline stages.
        y = y.to_global(placement=P1, sbp=B)
        z = self.m_stage1(y)
        return z

v0.8.0

class Graph(flow.nn.Graph):
    def __init__(self):
        super().__init__()
        self.m_stage0 = flow.nn.Linear(8, 8, False).to_global(placement=P_0, sbp=B)
        self.m_stage1 = flow.nn.Linear(8, 8, False).to_global(placement=P_1, sbp=B)
        # set_stage(stage_id, placement)
        # The Stage ID is numbered starting from 0 and increasing by 1.
        # The Placement is all tensors placement of this module.
        self.m_stage0.config.set_stage(stage_id=0, placement=P_0)
        self.m_stage1.config.set_stage(stage_id=1, placement=P_1)
        self.config.set_gradient_accumulation_steps(4)
    
    def build(self, x):
        # There will be automatically do tensor.to_global(placement) for all input tensor of this module.
        # So there is no need to write to_global() in/out of the module forward function.
        y = self.m_stage0(x)
        z = self.m_stage1(y)
        return z

New Features

Graph

Added new interfaces: oneflow.env.init_rdma and oneflow.env.rdma_is_initialized to delay turning on the RDMA, thus accelerating the network communications across multiple devices (Note: avoid using fork() after RDMA being turned on, for example, DataLoader’s num_workers > 1 should be executed before init rdma). https://github.com/Oneflow-Inc/oneflow/pull/8415
Graph provided new algorithm optimization interface: graph.config.enable_straighten_algorithm to optimize the execution order in computation graph, which maximizes the overlap between transferring and computation. With this interface, the data transfer speed witnesses a 0.6% rise in data parallelism mode and a 6% rise in model parallelism mode. (https://github.com/Oneflow-Inc/oneflow/pull/8347, https://github.com/Oneflow-Inc/oneflow/pull/8483, https://github.com/Oneflow-Inc/oneflow/pull/8495 )
Optimized the implementation of clip grad in Graph to support clip_grad_max_norm > 1.0 and provided configurable clip_grad_norm_type, which could only be set to 2 before but now can be set to +/- inf, +/- 1, +/- 2, +/- 3, and bigger p-norm values. See the reference from here (https://github.com/Oneflow-Inc/oneflow/pull/7548)
Global tensor in Graph supported the tensor.set_item operation for invariable ops, for example, mask[:, :len_keep] = 0 (https://github.com/Oneflow-Inc/oneflow/pull/7751)
Graph exported build_graph and compile_and_init_runtime interfaces, allowing to compile the pass that was previously self-defined by users after building the graph, thus rewriting and optimizing the graph. The two interfaces also supported Graph to restore an external graph (job). (https://github.com/Oneflow-Inc/oneflow/pull/8168)
Added the RegisterJobPass interface to support rewriting the self-defined external job pass graph. (https://github.com/Oneflow-Inc/oneflow/pull/8370)
oneflow.boxing.nccl.enable_use_compute_stream(True) optimized supports for NCCL logical kernel:
- Added noncontiguous ReduceScatter kernel to support the conversion of P -> S(i), (i > 0) (https://github.com/Oneflow-Inc/oneflow/pull/8361)
- Supported the conversion of B -> S (https://github.com/Oneflow-Inc/oneflow/pull/8355)
- Enabled nccl send/recv primitives to support special SBP conversions (https://github.com/Oneflow-Inc/oneflow/pull/8318)
Added the efficient fused kernel oneflow.nn.FusedMLP, which is controlled by export ONEFLOW_FUNCTOR_DISABLE_FUSED_MLP = 0 (https://github.com/Oneflow-Inc/oneflow/pull/7391, https://github.com/Oneflow-Inc/oneflow/pull/8165, https://github.com/Oneflow-Inc/oneflow/pull/8217, https://github.com/Oneflow-Inc/oneflow/pull/8413)

Debug

Graph.debug offered the new parameter: max_stack_depth (default = 2) to note the maximal stack depth of the Python stack where the op exists in Graph, making it convenient to locate the Python context for each op in Graph. (https://github.com/Oneflow-Inc/oneflow/pull/8028)
Apart from supporting printing the input/output/variable info of modules in Graph, it also newly supported printing operator info constructed in module forward. (https://github.com/Oneflow-Inc/oneflow/pull/8135)
Enabled export ONEFLOW_DEBUG_MODE=true and export GLOG_v=3 to print the full memory log, which contains multi-level MemBlock info on each device (Total Memory-> Chunk -> MemBlock), Block that has exclusive memory, Eager Variable and other information. Besides, a lifecycle label was added in Regst to analyze each tensor's memory lifecycle.
LightPlan provided a more simplified way to display Actor Graph, cutting down the cost of debug based on Plan. When ONEFLOW_DEBUG_MODE = true , a series of light plan files corresponding to each rank in Graph will be generated under the log/local_rank_0/machine/ directory, containing simplified actor sub-graphs in each rank, and the filename is GraphName_rank_i_light_plan. (https://github.com/Oneflow-Inc/oneflow/pull/8396)
The print graph method allowed to display the logic graph by Module, making the debugging more efficient in constructing graphs. (https://github.com/Oneflow-Inc/oneflow/pull/8131)

Eager

Supported passing extra parameters when Optimizer ParamGroup is being built, meeting other special operation demands for LrScheduler. (https://github.com/Oneflow-Inc/oneflow/pull/7753)
- ```
param_groups = [{"params": [model.parameters()], "excess_param": ...}]
optim = optim.Adam(param_groups, lr=0.1)
```
Added the oneflow.cuda.current_device interface to return the device index of the current rank (https://github.com/Oneflow-Inc/oneflow/pull/7856)
Added the oneflow.utils.from_torch interface to convert a PyTorch Tensor into an OneFlow Tensor (https://github.com/Oneflow-Inc/oneflow/pull/7851)
Added the oneflow.utils.to_torch interface to convert an OneFlow Tensor into a PyTorch Tensor (https://github.com/Oneflow-Inc/oneflow/pull/7851)
Added the oneflow.cuda.empty_cache interface to manually release memory https://github.com/Oneflow-Inc/oneflow/pull/8482)
Added the oneflow.roc_auc_score interface on CPU, which is equivalent to sklearn.metrics.roc_auc_score (https://github.com/Oneflow-Inc/oneflow/pull/7951)

Tensor

Provided the Tensor.contiguous_ interface as the contiguous operation for the inplace version (https://github.com/Oneflow-Inc/oneflow/pull/8275)
Added the Tensor.local_to_global and Tensor.global_to_global interfaces to separately implement different default check meta operations (https://github.com/Oneflow-Inc/oneflow/pull/8027)
Global Tensor's Slice/SliceUpdate supported all nd_sbp inputs, and SliceUpdate fully supported the inplace operation and backpropagation (https://github.com/Oneflow-Inc/oneflow/pull/8313, https://github.com/Oneflow-Inc/oneflow/pull/8337, https://github.com/Oneflow-Inc/oneflow/pull/8344, https://github.com/Oneflow-Inc/oneflow/pull/8416)

Global Boxing

Eager Global Tensor supported balanced spliter nd sbp eager boxing (https://github.com/Oneflow-Inc/oneflow/pull/7768)
Supported executing Eager Slice Boxing on random devices, including non-CPU devices and non-CUDA-capable devices (https://github.com/Oneflow-Inc/oneflow/pull/8180)

OneEmbedding

For better recommendations, modern recommendation systems always rely on huge Embedding tables. Besides, frequent iterations of user data require model training to be fast enough.

OneEmbedding is a component designed for large-scale recommendation systems, and it's efficient, extensible, and highly flexible. The following are its advantages:

Hierarchical storage and dynamic capacity expansion: users can expand the capacity of the Embedding at much lower cost.
Mixed parallelism strategy: it supports easily extending the model to train it on multi-machine multi-GPU.
Embedding quantization for better communication: in the parallel scenario, communication data can be quantized to reduce the communication amount, thus accelerating the training.
Efficient data pipeline: the model parts that have no data dependency can be executed in advance, thus overlapping with other operations in time.
Automatic mixed precision training: data can be computed in FP16 to reduce the occupied memory, thus accelerating the training speed and ensuring high model convergence precision.
A collection of efficient CUDA ops for common operations in recommendation systems is available.
Flexible model building is supported.

See OneEmbedding API documentation from here.

PyTorch Compatibility

A collection of new functionalities and interfaces that are compatible with PyTorch 1.10.0 have been added.

Tensor

Added the Tensor.pin_memory functionality, which supports changing the memory to pinned memory when the tensor is being created. (https://github.com/Oneflow-Inc/oneflow/pull/8073)
- Supported passing the pin_memory parameter when the tensor is being created. (https://github.com/Oneflow-Inc/oneflow/pull/8176)
- DataLoader supported pin_memory (https://github.com/Oneflow-Inc/oneflow/pull/8214)
- Added theTensor.is_pinned attribute (https://github.com/Oneflow-Inc/oneflow/pull/8447)
Added the ~Tensor (invert) method to conduct logic NOT operation for each tensor with the dtype of .bool. (https://github.com/Oneflow-Inc/oneflow/pull/7899)
Added the Tensor.log2 method to get log2 for each tensor. (https://github.com/Oneflow-Inc/oneflow/pull/7906)
Added the Tensor.new_zeros method to generate a new tensor that has a shape of 0. (https://github.com/Oneflow-Inc/oneflow/pull/7937)
Added the oneflow.as_tensor interface to convert the input data into a new tensor that shares data. (https://github.com/Oneflow-Inc/oneflow/pull/7855)
Added the Tensor.__array__ method. np.array supports to input oneflow tensor to construct np.ndarry object. (https://github.com/Oneflow-Inc/oneflow/pull/7970)
Added the Tensor.new_tensor method to copy the input data to generate a new tensor. (https://github.com/Oneflow-Inc/oneflow/pull/7973)
Added the Tensor.half method, which is equivalent to tensor.to (oneflow.float16) . (https://github.com/Oneflow-Inc/oneflow/pull/7971)
Added the Tensor.byte method to generate a new uint8 tensor, and tensor.byte() is equivalent to tensor.to(oneflow.uint8). (https://github.com/Oneflow-Inc/oneflow/pull/8053)
Added the Tensor.view_as and Tensor.new_empty methods (https://github.com/Oneflow-Inc/oneflow/pull/8077)
Added the Tensor.type method to implement corresponding cast and adding objects for oneflow(.cuda).{Byte, Char, Short, Int, Long, Half, Float, Double}Tensor (https://github.com/Oneflow-Inc/oneflow/pull/8129)
Added the Tensor.dot method to compute the dot product of two 1D tensors, and this method is equivalent to oneflow.dot. (https://github.com/Oneflow-Inc/oneflow/pull/8520)
Added the oneflow.nn.init.orthogonal_ interface to initialize tensors (https://github.com/Oneflow-Inc/oneflow/pull/8009)

Operators

Added the oneflow.nn.Softshrink op (https://github.com/Oneflow-Inc/oneflow/pull/7826)
Added the oneflow.nn.Threshold op (https://github.com/Oneflow-Inc/oneflow/pull/7875)
Added the oneflow.nn.Hardshrink activation function (https://github.com/Oneflow-Inc/oneflow/pull/7887)
Added the oneflow.isnan and oneflow.isinf interfaces to decide the element in tensor is nan or inf (https://github.com/Oneflow-Inc/oneflow/pull/7943)
The oneflow.nn.functional.* interface supported passing the numpy scalar parameter (https://github.com/Oneflow-Inc/oneflow/pull/7935)
Added the oneflow.nn.functional.cosine_similarity op to calculate the cosine similarity of two tensors (https://github.com/Oneflow-Inc/oneflow/pull/8119)
Added the oneflow.nn.functional.conv_transpose1d, the oneflow.nn.functional.conv_transpose2d op, and thenn.functional.conv_transpose3d op (https://github.com/Oneflow-Inc/oneflow/pull/7991)
Added the oneflow.unbind interface to return a tuple of all slices along a given dimension (https://github.com/Oneflow-Inc/oneflow/pull/7730)
Added the oneflow.swapdims interface to specify the swapping of two dimensions, and oneflow.swapdims is equivalent to NumPy’s swapaxes. (https://github.com/Oneflow-Inc/oneflow/pull/7659)
Added the oneflow.addcmul op to execute an element-wise composite function: out=input+value×tensor1×tensor2 (https://github.com/Oneflow-Inc/oneflow/pull/7282)
Added the oneflow.searchsorted op (https://github.com/Oneflow-Inc/oneflow/pull/7949)
Added the oneflow.mm op (https://github.com/Oneflow-Inc/oneflow/pull/8440)
Added the oneflow.tensordot interface and offered a collection of cases of equivalent transformation operations (https://github.com/Oneflow-Inc/oneflow/pull/7968)
Added the oneflow.repeat_interleave op to repeat the elements of the tensor, and this op is equivalent to numpy.repeat (https://github.com/Oneflow-Inc/oneflow/pull/8324)
Added the oneflow.amax and Tensor.amax methods (https://github.com/Oneflow-Inc/oneflow/pull/7996)
Added the oneflow.median and Tensor.median methods (https://github.com/Oneflow-Inc/oneflow/pull/8069)
Added the oneflow.normal method and fixed the Tensor.normalmethod (https://github.com/Oneflow-Inc/oneflow/pull/7956)
Added the oneflow.amin and Tensor.amin methods (https://github.com/Oneflow-Inc/oneflow/pull/8042)
Added the oneflow.mv op and Tensor.mv method (https://github.com/Oneflow-Inc/oneflow/pull/8445)

Random

Added new interfaces: oneflow.cuda.manual_seed, oneflow.cuda.manual_seed_all, oneflow.seed, oneflow.manual_seed, oneflow.initial_seed, oneflow.get_rng_state, oneflow.set_rng_state and improved the configuration of OneFlow random seed initialization. (https://github.com/Oneflow-Inc/oneflow/pull/7957 )

AutoGrad

Added new interfaces: oneflow.set_grad_enabled and oneflow.enable_grad to enable or disable automatic gradient update for some of subgraphs. (https://github.com/Oneflow-Inc/oneflow/pull/8016)
Supported the upstream gradient dtype of the autograd reverse operator is different from that of the input. (https://github.com/Oneflow-Inc/oneflow/pull/8233, https://github.com/Oneflow-Inc/oneflow/pull/8309)
Supported the backward operator that does not capture any tensor to execute backward computation multiple times. (https://github.com/Oneflow-Inc/oneflow/pull/8031)

CUDA

Added APIs for oneflow.cuda.set_device and oneflow.cuda.synchronize. (https://github.com/Oneflow-Inc/oneflow/pull/8322)

RNN

Refactored the Module of RNN and migrated the implementation of Python layer splicing to C++, which greatly optimized the performance. Added modules related to RNNCell and modules aligned with the torch.nn.utils.rnn in functionality:
- Refactored modules: RNN, LSTM, and GRU
- Added modules: RNNCell , LSTMCell, GRUCell, andoneflow.nn.utils.rnn
- Supported and fixed RNN unit tests of local and global, and completed documentation.

Device

Supported heterogeneous equipment type: In order to cope with the complexity of different hardware, OneFlow, in line with the dependency inversion principle in software engineering, has introduced a hardware abstraction layer called Execution Provider (EP). The hardware abstraction layer is composed of a series of interfaces, which are abstracted from the capabilities provided by the required hardware devices during the running of the framework. After the hardware abstraction layer is introduced, each modules can directly call the interface provided by the hardware abstraction layer, not the original hardware interface, to use the underlying hardware, so it's unneccessary to concern the specific details of the underlying hardware. When a new hardware device is introduced, because the hardware abstraction interface remains unchanged, all modules can adapt to the new hardware device without any modification. At the same time, when adapting new hardware for the framework, we do not need to pay attention to the specific implementation details of the framework. We only need to implement a series of interfaces according to the agreement of the hardware abstract interface and the actual situation of the hardware device, and then the hardware adaptation can be completed.

Execution Provider has defined a collection of runtime interfaces: device registration interface, device management interface, queue management interface, event management interface, and memory management interface.

Primitive

In addition to the runtime interfaces, the Execution Provider has also defined a set of computing interfaces called Primitive, which are used to describe the commonly-used computation in the deep learning framework, thus simplifying the development of operators in hardware adaptation. Compared with the runtime interfaces provided by the Execution Provider, the interfaces provided by Primitive are more loose and flexible. All interfaces are mutually independent, and each interface represents a specific computing capability provided by a certain hardware device. Similar to runtime interfaces, the abstraction of interfaces provided by Primitive is closer to the device side, and developers can carry out adaptation work without an in-depth understanding of OneFlow's mechanism. Developers need to implement all interfaces provided by Execution Provider when adapting runtime interfaces, but in the process of adapting Primitive, developers can selectively adapt according to the actual situation of the project.

Added unit test of ep::primitive basic function (https://github.com/Oneflow-Inc/oneflow/pull/8099)
Added ep::primitive::constant_pad, optimized performance, removed obsolete pad grad and used pad as the inverse of pad (https://github.com/Oneflow-Inc/oneflow/pull/8152)
Used unary primitive interface instead of original implementation in Kernel (https://github.com/Oneflow-Inc/oneflow/pull/8270)
Added environment variable ONEFLOW_EP_CUDA_CUBLAS_WORKSPACE_SIZE_MB to configure cublas workspace size (https://github.com/Oneflow-Inc/oneflow/pull/8478)
Scalar logical kernel supported primitives (https://github.com/Oneflow-Inc/oneflow/pull/8531)
Used primitives to implement logical not kernel (https://github.com/Oneflow-Inc/oneflow/pull/8544)
Migrated all activation kernels to use primitive (https://github.com/Oneflow-Inc/oneflow/pull/8300)
Bias add kernel supported primitive (https://github.com/Oneflow-Inc/oneflow/pull/8512)
Decoupled OneDNN from ep::primitive CPU device and provided environment variable ONEFLOW_ENABLE_ONEDNN_OPTS to enable onednn to accelerate CPU primitive interface (https://github.com/Oneflow-Inc/oneflow/pull/8274)

Debug tools

Saved the log independently for each rank to log/local_rank_{i} when launching multiple processes by launcher. (https://github.com/Oneflow-Inc/oneflow/pull/7825)
Optimized the display of OF_PROFILER_RANGE_GUARD in nsys. (https://github.com/Oneflow-Inc/oneflow/pull/8121)

OneFlow-Profiler

OneFlow-Profiler is designed to collect various performance-related information during the execution flow of the framework. It can calculate the execution time of the operator or system components, the allocation of memory and DRAM, and can record the input and parameter information corresponding to the operator. This information can be used by developers to analyze which part brings the most overhead and implement some targeted optimizations.

Added OneFlow-Profiler. (https://github.com/Oneflow-Inc/oneflow/pull/8047)
Profiled the information of the CUDA operator. (https://github.com/Oneflow-Inc/oneflow/pull/8195)
Profiled the bandwidth information of the operator. (https://github.com/Oneflow-Inc/oneflow/pull/8254)
Added interfaces to collect bandwidth information and optimized code implementation. (https://github.com/Oneflow-Inc/oneflow/pull/8332)
Refined Profiler. (https://github.com/Oneflow-Inc/oneflow/pull/8332)
Used Kineto and CUPTI to profile the information of CUDA operator. (https://github.com/Oneflow-Inc/oneflow/pull/8417)

Auto-Test

When the value check fails, the value of the input tensor and Paramter will be automatically printed, and the pseudo-code segment of the output program will be highlighted for debugging (https://github.com/Oneflow-Inc/oneflow/pull/8383)

AutoProf

AutoProf is a framework designed to test the performance of OneFlow and PyTorch operators. It can automatically test the operator performance and print a comparison table under different CPU threads and GPUs. At present, it has been applied to the development of some existed operators and all new operators. Its effect is shown below:

Added auto speed comparison framework of operator AutoProf to automatically run op to test: (https://github.com/Oneflow-Inc/oneflow/pull/8207)
- The speed of OneFlow and PyTorch.
- The speed of CPU/GPU Kernel under different numbers of threads.
- Total end-to-end time with CPU Kernel.
Optimized the display of AutoProf to save testing time. (https://github.com/Oneflow-Inc/oneflow/pull/8303)
Supported API tests without actual kernel execution, and the time would be end2end. (https://github.com/Oneflow-Inc/oneflow/pull/8320)
Supported AutoProf to measure kernel bandwidth. (https://github.com/Oneflow-Inc/oneflow/pull/8367)

IR

Used Cast to remove pass. (https://github.com/Oneflow-Inc/oneflow/pull/7837 )
Used MLIR to complete constant folding, combined the composition optimization of Conv and BN. (https://github.com/Oneflow-Inc/oneflow/pull/7799)
Optimized constant folding in OneFlow C++ API. (https://github.com/Oneflow-Inc/oneflow/pull/8124)
Provided fault tolerance checking for parsed module. (https://github.com/Oneflow-Inc/oneflow/pull/8299)
Fixed the BUG of constant folding unit test. (https://github.com/Oneflow-Inc/oneflow/pull/8340)
Supported IREE. (https://github.com/Oneflow-Inc/oneflow/pull/8249)
Added oneflow_iree(python) to CI. (https://github.com/Oneflow-Inc/oneflow/pull/8431)
Removed redundant output_lbns in IR. (https://github.com/Oneflow-Inc/oneflow/pull/8409)
Provided a conversion marker for Variable -> constant. (https://github.com/Oneflow-Inc/oneflow/pull/8412)
Removed hardcoded properties in IR. (https://github.com/Oneflow-Inc/oneflow/pull/8420)
Implemented AutoNHWC Pass and provided environment variable ONEFLOW_MLIR_PREFER_NHWC. Supported automatic conversion of common network data formats to channels last optimization and had a noticeable acceleration on NVIDIA graphics cards that support FP16. (https://github.com/Oneflow-Inc/oneflow/pull/7890)

Performance

Graph

Optimized the speed and memory of GPT and BERT under 3-D parallelism:
- Performance optimization: fused_scale_mask_softmax operator supported broadcast input. Optimized the kernel implementation and performance of softmax under specific cols (1024). Optimized the incomplete GetSbp list of fused_scale_mask_softmax reverse operator. (https://github.com/Oneflow-Inc/oneflow/pull/8321)
- Communication optimization: Optimized the communication cost of SBP cost under B->S, B->B, B->P. (https://github.com/Oneflow-Inc/oneflow/pull/8378)
- Interface optimization: Optimized the inefficient edge connection problem caused by the misalignment of stage id and to_global sequence dependency when using pipeline stage. (https://github.com/Oneflow-Inc/oneflow/pull/8442)
- Communication optimization: nccl_use_compute_stream supported more comprehensive sbp conversions like P -> S(i). (https://github.com/Oneflow-Inc/oneflow/pull/8361)
- Communication optimization: Parallel use of RDMA communication. (https://github.com/Oneflow-Inc/oneflow/pull/8415)
- Memory optimization: Eliminated the randomness of the memory multiplexing algorithm, so that the memory multiplexing effect of each rank is consistent when the subgraphs are the same. There will be no bad case. (https://github.com/Oneflow-Inc/oneflow/pull/8441)
- Memory optimization: Removed the extra buffer problem of Stage 0 CPU copy under Pipeline parallelism. (https://github.com/Oneflow-Inc/oneflow/pull/8484)
- Memory optimization: Under Checkpointing and Pipeline, the input identity of the module was de-duplicated to reduce additional Checkpointing tensor, and added the block name prefix of the module to the identity. (https://github.com/Oneflow-Inc/oneflow/pull/8509)
- Combination Optimization: ZeRO-DP supported using with Pipeline parallel and 3-D parallel. (https://github.com/Oneflow-Inc/oneflow/pull/8464)
  - Memory optimization: Removed extra identity tensor in ZeRO optimization. (https://github.com/Oneflow-Inc/oneflow/pull/8407)
Provided new environment variable optimization switches: ONEFLOW_ENABLE_MULTI_TENSOR_MODEL_UPDATE and ONEFLOW_FUSE_MODEL_UPDATE_CAST . In the case of AMP, they supported the fusion of the Optimizer model update kernel and the next round of forward cast operators. (https://github.com/Oneflow-Inc/oneflow/pull/8373)

Eager

Enabled export ONEFLOW_EAGER_LOCAL_TO_GLOBAL_BALANCED_OVERRIDE =true to accelerate the execution of Eager Global, which can save the synchronization of meta information on each rank of Global Tensor. (when users are confident that their code execution is symmetrical, SPMD)(https://github.com/Oneflow-Inc/oneflow/pull/7981)

This environment variable is used to indicate whether the shape of the input data is the same when local to global is executed. If it is set to true, there is no need to synchronize the shape of each rank, and the logical shape is calculated locally.
Used python c api to replace pybind11 to optimize the calling speed of tensor and functional.
- Optimized functional return types to save overhead and avoid reference copies. And solved the bug that the inplace tensor id may be inconsistent. (https://github.com/Oneflow-Inc/oneflow/pull/7985)
- Moved tensor API from pybind11 to c python API. Added tensor hash function. Resolves function naming conflict. (https://github.com/Oneflow-Inc/oneflow/pull/8258, https://github.com/Oneflow-Inc/oneflow/pull/8315, https://github.com/Oneflow-Inc/oneflow/pull/8342, https://github.com/Oneflow-Inc/oneflow/pull/8375)
Performance optimization: Let vm worker threads concentrate on computing tasks, and decoupled memory tasks from computing tasks. (https://github.com/Oneflow-Inc/oneflow/pull/7976)
Optimized the speed of operations in DataLoader, including MakeLocalTensorFromData, which is 20% faster under swin-T dataloader. (https://github.com/Oneflow-Inc/oneflow/pull/8066)

Operators & Tensor

Optimized global sparse_softmax_cross_entropy kernel. (https://github.com/Oneflow-Inc/oneflow/pull/7298)
Optimized and sped up CPU permute kernel with OneDNN. (https://github.com/Oneflow-Inc/oneflow/pull/7872)
Optimized and sped up CPU softmax kernel with OneDNN. (https://github.com/Oneflow-Inc/oneflow/pull/8071 ， https://github.com/Oneflow-Inc/oneflow/pull/8075)
Optimized the memory and speed required for the reverse calculation of the pooling kernel. (https://github.com/Oneflow-Inc/oneflow/pull/7980)
Optimized Slice and Tensor getitem operations based on View to improve the speed of dataloader. (https://github.com/Oneflow-Inc/oneflow/pull/8148, https://github.com/Oneflow-Inc/oneflow/pull/8211, https://github.com/Oneflow-Inc/oneflow/pull/8243)
Optimized the reverse composition logic of flip and cumsum, and remove some grad operators. When testing Grad diffs, used random value tests to increase test robustness. (https://github.com/Oneflow-Inc/oneflow/pull/8155)
Optimized the memory usage of the NormalizationAddReluGrad operator and added versions that does not require addend_diff. (https://github.com/Oneflow-Inc/oneflow/pull/8213)
Optimized and sped up the implementation of tensor.reshape and tensor.reshape_as from python implementation to c++ implementation. (https://github.com/Oneflow-Inc/oneflow/pull/8304)
Converted tensor.view, tensor.view_as, tensor.permute, tensor.transpose, tensor.contiguous_ from python implementation to c++ implementation. (https://github.com/Oneflow-Inc/oneflow/pull/8317)
Greatly optimized the performance of index_select and repeat_interleave by using gather to replace dim gather. (https://github.com/Oneflow-Inc/oneflow/pull/8360)
Optimized and removed temporary memory in cumprod cpu grad kernel. (https://github.com/Oneflow-Inc/oneflow/pull/8369)
The embedding operator supported amp, improved the performance under normal path, and fixed the bug that the gather cpu kernel memory out of bounds. (https://github.com/Oneflow-Inc/oneflow/pull/8374)
Optimized the performance of Tensor.fill_. (https://github.com/Oneflow-Inc/oneflow/pull/8283)
Greatly optimized the performance of the broadcast element-wise binary family operators in reverse calculation. (https://github.com/Oneflow-Inc/oneflow/pull/8339)
Added fusion operator BinaryCrossEntropyWithLogitsReduceMean. (https://github.com/Oneflow-Inc/oneflow/pull/8476)
Added high-performance matrix multiplication Fused kernel based on cublasLt. (https://github.com/Oneflow-Inc/oneflow/pull/8462, https://github.com/Oneflow-Inc/oneflow/pull/8222, https://github.com/Oneflow-Inc/oneflow/pull/8063)

Primitive

Lowered the elementwise.cuh template's requirement for pointer alignment.

Improvements

Graph

Exported oneflow env to python and used python's objects to manage its lifecycle. (https://github.com/Oneflow-Inc/oneflow/pull/7792)
Used Python's reference counting to control the life cycle of Graph and constructed strict and rich destruction test cases. (https://github.com/Oneflow-Inc/oneflow/pull/7857)
Supported recycling independent threads that can no longer be reused when Graph is destructed. (https://github.com/Oneflow-Inc/oneflow/pull/7862)
Changed the basic configuration of resource from one-time static effect to real-time effect. (https://github.com/Oneflow-Inc/oneflow/pull/8444)
Consolidated the nccl_comm dynamically created by the Graph NCCL logical kernel into the runtime for initial creation to avoid the deadlock caused by the inconsistency between the creation order of each rank and the eager nccl comm creation order. (https://github.com/Oneflow-Inc/oneflow/pull/8263)
Refactor optimization: Merged nn.graph.util.IONode , nn.graph.util.IONodeType into IOArgs. (https://github.com/Oneflow-Inc/oneflow/pull/8272)
Refactor optimization: Renamed the global singleton Global object to the Singleton object. (https://github.com/Oneflow-Inc/oneflow/pull/8490)
Refactor optimization: Removed gpu_device_num (https://github.com/Oneflow-Inc/oneflow/pull/8516)
Refactor optimization: Removed outdated AvailableMemDesc concepts. (https://github.com/Oneflow-Inc/oneflow/pull/8145)
Refactor optimization: Removed outdated Model IO Kernel logic. (https://github.com/Oneflow-Inc/oneflow/pull/8151)
Refactor optimization: Replaced GpuDeviceNum with the actual number of devices to avoid coupling with specific device types. (https://github.com/Oneflow-Inc/oneflow/pull/8166)

Eager

C++ is available now. You can manually trigger allocator gc on each stream (applicable in ZeRO)（https://github.com/Oneflow-Inc/oneflow/pull/8452）
The execution of Eager VirtualMachine instruction is based on the execution of EP. (https://github.com/Oneflow-Inc/oneflow/pull/7923)
Optimized and removed all redundant interfaces of Get(Ptr)OrThrow. (https://github.com/Oneflow-Inc/oneflow/pull/7812)
Added the validity check of flow.save(global_dst_rank). (https://github.com/Oneflow-Inc/oneflow/pull/7964)
Supported the backward function node to run multiple times if it does not capture any tensor. (https://github.com/Oneflow-Inc/oneflow/pull/8031)
Added the ThreadLocalCached decorator to clear the cache in time to alleviate increasing memory. (https://github.com/Oneflow-Inc/oneflow/pull/7858)
Added std for C++14::inclusive_scan/std::exclusive_scan implementations. (https://github.com/Oneflow-Inc/oneflow/pull/8128)
Packaged the parameters required by the eager opkernel and pass them in each thread to solve some thread-unsafe problems. (https://github.com/Oneflow-Inc/oneflow/pull/7617)
Eager Stream supports kernel computation on pinned memory. (https://github.com/Oneflow-Inc/oneflow/pull/8486)
Introduced a tool class for dim range check to replace simplified Functor's various checking logic for dimensions. (https://github.com/Oneflow-Inc/oneflow/pull/8382)
Refactoring and optimization: removed the Blob object in EagerBlobObject, which leads to redundant TensorView instructions. At the same time, in order to support ShapeView efficiently, the elem_cnt attribute has also been removed. (https://github.com/Oneflow-Inc/oneflow/pull/7895)
Refactoring and optimization: extracted the algorithm used by BinAllocator to share dynamic memory pools
Refactoring and optimization: VectorAt and MapAt functions uniformly use reference to pass parameters to solve the mixed use of reference interface and pointer interface. (https://github.com/Oneflow-Inc/oneflow/pull/8191)
Refactoring and optimization: removed the cfg application on C++. (https://github.com/Oneflow-Inc/oneflow/pull/8158)
Refactoring and optimization: removed the outdated code related to RemoteBlob in Single-Client. (https://github.com/Oneflow-Inc/oneflow/pull/8228)
Refactoring and optimization: merged duplicate logic in eager boxing ccl and nccl boxing expr. (https://github.com/Oneflow-Inc/oneflow/pull/7930)
Refactoring and optimization: removed cfg on Python and reduced the number of symbols to optimize the link speed of compilation.
Refactoring and optimization: merged symbol::IdCache and symbol::Storage. (https://github.com/Oneflow-Inc/oneflow/pull/8331)
Refactoring and optimization: introduced llvm::SmallVetor and used oneflow::small_vector instead of fixed_vector. Besides, we have optimized the implementation and usage of Shape and Stride. (https://github.com/Oneflow-Inc/oneflow/pull/8365 , https://github.com/Oneflow-Inc/oneflow/pull/8402)
Refactoring and optimization: refactored ShapeView and Shape to eliminated duplication and inconsistencies. (https://github.com/Oneflow-Inc/oneflow/pull/8422)
Refactoring and optimization: eager VirtualMachine has decoupled InstructionType's dependency on StreamType. (https://github.com/Oneflow-Inc/oneflow/pull/7607)
Refactoring and optimization: removed the InstructionMsg class and merged all its functions and fields into the Instruction class. (https://github.com/Oneflow-Inc/oneflow/pull/7623)

Operators & Tensor

Stride support:
- Tensor, UserOp and UserKernel in user_op:: all supported stride attribute. (https://github.com/Oneflow-Inc/oneflow/pull/7829)
- cast supports stride. (https://github.com/Oneflow-Inc/oneflow/pull/8292)
View support and optimization:
- Added a new input tensor to decide whether to support non-contiguous when making op definitions. Besides, we now support transpose, permute, narrow, expand, expand_as, split, chunk, unfold_tensor, movedim, as_strided, select, swapaxes, T, t, hsplit, vsplit, tensor_split none-contiguous view ops.(https://github.com/Oneflow-Inc/oneflow/pull/7813)
- Tensor slice used view operations by default.（https://github.com/Oneflow-Inc/oneflow/pull/8302）
Automatically generated version status (Feature Stage) for OneFlow's API. (https://github.com/Oneflow-Inc/oneflow/pull/7945)
Optimized CUDA memset to cudaMemsetAsync（https://github.com/Oneflow-Inc/oneflow/pull/7763）
LeakyReLU supported inplace optimization. (https://github.com/Oneflow-Inc/oneflow/pull/8060)
Added the following parameters to nn.Embedding interface: padding_idx, max_norm, norm_type, scale_grad_by_freq. (https://github.com/Oneflow-Inc/oneflow/pull/8110)
Aligned PyTorch's max_pool_1d, max_pool_2d, max_pool_3d, avg_pool_1d, avg_pool_2d, avg_pool_3d, and distinguish old pooling kernel aligned with TensorFlow. (https://github.com/Oneflow-Inc/oneflow/pull/8111)
VectorAt supported passing in non-const references: JUST(VectorAt(vec, 1)) = 5;. (https://github.com/Oneflow-Inc/oneflow/pull/8013)
Reduced the uncommon kernel template specializations of layer norm. (https://github.com/Oneflow-Inc/oneflow/pull/8209)
Modified the logic of Tensor.numpy to avoid extra memory growth when saving the model. (https://github.com/Oneflow-Inc/oneflow/pull/8449)
Tensor str supported printing nd sbp. (https://github.com/Oneflow-Inc/oneflow/pull/8458)
Slice supported SBP infer (S->P), and the semi-automatically deduced sbp was able to selecte the same sbp as expected in the reducible nd_sbp. (https://github.com/Oneflow-Inc/oneflow/pull/8536)
When printing non-CPU and non-CUDA tensor, you must copy to cpu first and then print. (https://github.com/Oneflow-Inc/oneflow/pull/8548)
Refactoring and optimization: decoupling user kernel and device tag. (https://github.com/Oneflow-Inc/oneflow/pull/8529)
Refactoring and optimization: a series of kernels (squeeze, reshape_like, flatten, expand_dims, reshape, amp_white_identity, identity, identity_buffer, parallel_cast, hierarchical_parallel_cast, hierarchical_parallel_cast_like) were refactored to CopyDataContentKernel https://github.com/Oneflow-Inc/oneflow/pull/8537
Refactoring and optimization: removed obsolete constant_pad1d , constant_pad2d , constant_pad3d kernel. (https://github.com/Oneflow-Inc/oneflow/pull/8113)
Refactoring and optimization: removed obsolete old lazy upsample kernel implementation.(https://github.com/Oneflow-Inc/oneflow/pull/8188)
Refactoring and optimization: removed obsolete message in shape proto and used sequential to represent stride. (https://github.com/Oneflow-Inc/oneflow/pull/8220)
Refactoring and optimization: removed obsolete multiply kernel, whick was included in broadcast_mul. (https://github.com/Oneflow-Inc/oneflow/pull/8359)
Refactoring and optimization: Renamed the shape in UserOp/Kernel to shape_view interface. (https://github.com/Oneflow-Inc/oneflow/pull/8433)
Refactoring and optimization: removed oneflow gemm. (https://github.com/Oneflow-Inc/oneflow/pull/8499)
Optimized the Maybe return type of such interfaces as Scalar.As(). (https://github.com/Oneflow-Inc/oneflow/pull/8348)

Device

Code refactoring ep::CpuDevice (https://github.com/Oneflow-Inc/oneflow/pull/7911)
Code refactoring: removed hard-coded special decision for device type like "cpu", "cuda" from system code. (https://github.com/Oneflow-Inc/oneflow/pull/8201)
Removed all dnn-related interfaces from the old version of KernelUtil (Primitive will be used to replace those interfaces). (https://github.com/Oneflow-Inc/oneflow/pull/8141)
Removed all interfaces related to mathematical calculation in the old version of KernelUtil (Primitive will be used to replace those interfaces). (https://github.com/Oneflow-Inc/oneflow/pull/8157)
Removed incomplete special decision for 'cuda 'device type in scope util. (https://github.com/Oneflow-Inc/oneflow/pull/8173)
Achieved delayed capture of CUDA Graph(https://github.com/Oneflow-Inc/oneflow/pull/8474)
Code refactoring: removed cuda_event. (https://github.com/Oneflow-Inc/oneflow/pull/8493)
Code refactoring: removed useless WITH_CUDA macro. (https://github.com/Oneflow-Inc/oneflow/pull/8562)

Tests

Eager Global Module Tests：

In 0.8.0, we have completed the ability of all kernels to deal with global tensor in distributed situation, and fixed many known bugs related to sbp. The global tensor worked efficiently and correctly at the kernel level. No matter how the distributed topology structure changes, the same algorithm logic can efficiently get mathematically consistent results, which greatly reduced the trouble of verifying correctness in the complex, diverse and asymmetric distributed parallel training process.

module/functional op	PR
abs	Oneflow-Inc/oneflow#7540
0_dim_tensor	Oneflow-Inc/oneflow#7540
activation	Oneflow-Inc/oneflow#7540
adaptive_pool	Oneflow-Inc/oneflow#7563
addmm	Oneflow-Inc/oneflow#7565
add	Oneflow-Inc/oneflow#7204
affine_grid	Oneflow-Inc/oneflow#7578
arange	Oneflow-Inc/oneflow#7576
argmax	Oneflow-Inc/oneflow#7579
argmin	Oneflow-Inc/oneflow#7581
argsort	Oneflow-Inc/oneflow#7582
argwhere	Oneflow-Inc/oneflow#7584
avgpool	Oneflow-Inc/oneflow#7585
batch_gather	Oneflow-Inc/oneflow#7590
bernoulli	Oneflow-Inc/oneflow#7732
bmm	Oneflow-Inc/oneflow#7741
broadcast_like	Oneflow-Inc/oneflow#7742
cast	Oneflow-Inc/oneflow#7773
ceil	Oneflow-Inc/oneflow#7744
chunk	Oneflow-Inc/oneflow#7750
clamp	Oneflow-Inc/oneflow#7752
clip_grad	Oneflow-Inc/oneflow#7757
concat	Oneflow-Inc/oneflow#7204
conv1d	Oneflow-Inc/oneflow#7769
conv2d	Oneflow-Inc/oneflow#7771
conv3d	Oneflow-Inc/oneflow#7771
cumsum	Oneflow-Inc/oneflow#7772
deconv2d	Oneflow-Inc/oneflow#7772
diagonal	Oneflow-Inc/oneflow#7772
diag	Oneflow-Inc/oneflow#7421
div	Oneflow-Inc/oneflow#7421
dot	Oneflow-Inc/oneflow#7421
dropout	Oneflow-Inc/oneflow#7772
empty	Oneflow-Inc/oneflow#7508
eq	Oneflow-Inc/oneflow#7421
erfc	Oneflow-Inc/oneflow#7421
erf	Oneflow-Inc/oneflow#7421
expand	Oneflow-Inc/oneflow#7772
expm1	Oneflow-Inc/oneflow#7421
eye	Oneflow-Inc/oneflow#7421
flatten	Oneflow-Inc/oneflow#7421
flip	Oneflow-Inc/oneflow#7496
floor	Oneflow-Inc/oneflow#7421
fmod	Oneflow-Inc/oneflow#7421
fold	Oneflow-Inc/oneflow#7772
greater_equal	Oneflow-Inc/oneflow#7421
greater	Oneflow-Inc/oneflow#7366
fused_bias_add_dropout	Oneflow-Inc/oneflow#7867
fused_bias_add_gelu	Oneflow-Inc/oneflow#7867
fused_scale_mask_softmax_dropout	Oneflow-Inc/oneflow#7867
fused_scale_mask_softmax	Oneflow-Inc/oneflow#7867
fused_scale_tril	Oneflow-Inc/oneflow#7867
fused_self_attention	Oneflow-Inc/oneflow#7867
fused_tril_softmax_mask_scale	Oneflow-Inc/oneflow#7867
gather_nd	Oneflow-Inc/oneflow#7880
gather	Oneflow-Inc/oneflow#7880
glu	Oneflow-Inc/oneflow#7880
grid_sample	Oneflow-Inc/oneflow#7881
groupnorm	Oneflow-Inc/oneflow#7885
masked_fill	Oneflow-Inc/oneflow#7457
masked_select	Oneflow-Inc/oneflow#7492
math_ops	Oneflow-Inc/oneflow#7461
matmul	Oneflow-Inc/oneflow#7465
maxpool	Oneflow-Inc/oneflow#7683
max	Oneflow-Inc/oneflow#7450
mean	Oneflow-Inc/oneflow#7650
meshgrid	Oneflow-Inc/oneflow#7533
min_max_observer	Oneflow-Inc/oneflow#7725
min	Oneflow-Inc/oneflow#7450
movedim	Oneflow-Inc/oneflow#7679
moving_average_min_max_observer	Oneflow-Inc/oneflow#7726
mul	Oneflow-Inc/oneflow#7717
narrow	Oneflow-Inc/oneflow#7647
negative	Oneflow-Inc/oneflow#7644
ne	Oneflow-Inc/oneflow#7642
nms	Oneflow-Inc/oneflow#7536
nonzero	Oneflow-Inc/oneflow#7645
normalize	Oneflow-Inc/oneflow#7635
ones_like	Oneflow-Inc/oneflow#7635
parital_fc	Oneflow-Inc/oneflow#7534
permute	Oneflow-Inc/oneflow#7635
prod	Oneflow-Inc/oneflow#7635
randint	Oneflow-Inc/oneflow#7508
rand	Oneflow-Inc/oneflow#7508
reshape	Oneflow-Inc/oneflow#7472
roi_align	Oneflow-Inc/oneflow#7794
scatter_nd	Oneflow-Inc/oneflow#7807
scatter_ops	Oneflow-Inc/oneflow#7807
sign	Oneflow-Inc/oneflow#7818
slice	Oneflow-Inc/oneflow#7818
softplus	Oneflow-Inc/oneflow#7818
sparse_softmax_cross_entr	Oneflow-Inc/oneflow#7298
split	Oneflow-Inc/oneflow#7277
sqrt_square_sum	Oneflow-Inc/oneflow#7277
squeeze	Oneflow-Inc/oneflow#7289
stack	Oneflow-Inc/oneflow#7289
stateful_kernel_with_cache	Oneflow-Inc/oneflow#7289
std	Oneflow-Inc/oneflow#7303
sub	Oneflow-Inc/oneflow#7303
sum	Oneflow-Inc/oneflow#7303
tensor_ops	Oneflow-Inc/oneflow#7307
tensor_scatter_nd_update	Oneflow-Inc/oneflow#7308
tile	Oneflow-Inc/oneflow#7322
transpose	Oneflow-Inc/oneflow#7332
tril	Oneflow-Inc/oneflow#7322
TripletMarginLoss	Oneflow-Inc/oneflow#7332
triu	Oneflow-Inc/oneflow#7882
unfold	Oneflow-Inc/oneflow#7883
unfold_tensor	Oneflow-Inc/oneflow#7883
unsqueeze	Oneflow-Inc/oneflow#7882
upsample	Oneflow-Inc/oneflow#7884
var	Oneflow-Inc/oneflow#7891
view	Oneflow-Inc/oneflow#7886
weight_norm	Oneflow-Inc/oneflow#7886
where	Oneflow-Inc/oneflow#7886
zeropad2d	Oneflow-Inc/oneflow#7886

EP::Primitive

Exception

Improve exception error handling

Added reshape exception handling. (https://github.com/Oneflow-Inc/oneflow/pull/7847)
Improved the error message of module when the input information does not match. (https://github.com/Oneflow-Inc/oneflow/pull/7918)
Added the MAYBE_NEED_ERROR_MSG_CHECK environment variable to check whether the CHECK function of Maybe contains oneflow:: Error message. It is used to prompt developers to add error prompt message. (https://github.com/Oneflow-Inc/oneflow/pull/7955)
Improved the exception error message of gather op.(https://github.com/Oneflow-Inc/oneflow/pull/7979)
Improved LayerNorm error message. (https://github.com/Oneflow-Inc/oneflow/pull/8090)
Optimized the error message when Eager and Graph encounter multiple inconsistent input placement in op. (https://github.com/Oneflow-Inc/oneflow/pull/8054)
Improved the error message checking in activation-related kernel processing logic.(https://github.com/Oneflow-Inc/oneflow/pull/8080)
Improved the error message in tensor.to_global and tensor.to_local. (https://github.com/Oneflow-Inc/oneflow/pull/8067)
Improved the exception error message in the dot kernel. (https://github.com/Oneflow-Inc/oneflow/pull/8051)
Rewrited the exception check in batch_matmul kernel. (https://github.com/Oneflow-Inc/oneflow/pull/8186)
Fixed the problem of exception error checking when Python parses arg. (https://github.com/Oneflow-Inc/oneflow/pull/8205)
Improved the exception error checking logic of all array functor. (https://github.com/Oneflow-Inc/oneflow/pull/8116)
Improved the exception error checking logic of all binary functor. (https://github.com/Oneflow-Inc/oneflow/pull/8161)
Improved the exception error reporting logic in nn grad functor. (https://github.com/Oneflow-Inc/oneflow/pull/8210)
Added error message when Graph.build is not reloaded. (https://github.com/Oneflow-Inc/oneflow/pull/8250)
Added TypeError type and device-related error message. (https://github.com/Oneflow-Inc/oneflow/pull/8057)
Improved the error message of Eager SliceBoxing. (https://github.com/Oneflow-Inc/oneflow/pull/8232)
Improved the error message of broadcast op. (Improve the error message of broadcast op)
Improved the error message of Eager Boxing when it is at runtime. (https://github.com/Oneflow-Inc/oneflow/pull/7926)
Improved the error message of Tensor index. (https://github.com/Oneflow-Inc/oneflow/pull/8234)
Improved the error message in nn.functor. (https://github.com/Oneflow-Inc/oneflow/pull/7910)
Added check for Physical Shape when Graph compiles exec_graph. (https://github.com/Oneflow-Inc/oneflow/pull/8002)
Added default error message for CUDA check. (https://github.com/Oneflow-Inc/oneflow/pull/8427)
Added similar error checking information to add n calculation. (https://github.com/Oneflow-Inc/oneflow/pull/8495)
Improved the error message of arg sort. (https://github.com/Oneflow-Inc/oneflow/pull/8513)
Improved the error message of bias add. (https://github.com/Oneflow-Inc/oneflow/pull/8524)
Improved the error message in autograd function. (https://github.com/Oneflow-Inc/oneflow/pull/8496)
Improved the error message of batch gather. (https://github.com/Oneflow-Inc/oneflow/pull/8533)
Improved the error message prompt of defense code in autograd. (https://github.com/Oneflow-Inc/oneflow/pull/8525 ， https://github.com/Oneflow-Inc/oneflow/pull/8541)

Build

Supported CUDA 11.5, 11.6. (ttps://github.com/Oneflow-Inc/oneflow/pull/7852 , https://github.com/Oneflow-Inc/oneflow/pull/8423)
Fixed the version of click at 8.0.0. (https://github.com/Oneflow-Inc/oneflow/pull/7967)
Updated nccl version to 2.12.10. (https://github.com/Oneflow-Inc/oneflow/pull/7822)
Default alignment pytorch version 1.10.0. (https://github.com/Oneflow-Inc/oneflow/pull/7019)
Updated tvm oneflow frontend dependencies. (https://github.com/Oneflow-Inc/oneflow/pull/8048)
Updated the version of LLVM/MLIR to support IREE. (https://github.com/Oneflow-Inc/oneflow/pull/8068 ， https://github.com/Oneflow-Inc/oneflow/pull/8461)
Fixed the version of protobuf between 3.9.2 to 4.0. (https://github.com/Oneflow-Inc/oneflow/pull/8198)
Removed the cfg tool in cmake. (https://github.com/Oneflow-Inc/oneflow/pull/8218)
The environment variable of CMAKE INTERPROCEDURAL OPTIMIZATION was enabled by default. (https://github.com/Oneflow-Inc/oneflow/pull/8237)
Removed the XRT part in the OneFlow source code, and the OneFlow-XRT will be used as a third-party plugin for oneflow. (https://github.com/Oneflow-Inc/oneflow/pull/8273 ，https://github.com/Oneflow-Inc/oneflow/pull/8288)
- read more: https://github.com/Oneflow-Inc/oneflow-xrt
Changed Liboneflow to dynamic library. (https://github.com/Oneflow-Inc/oneflow/pull/8312)
Updated the version of clang-tidy to 14.0.4. Supports the following syntax now: NOLINT, NOLINTNEXTLINE, NOLINTBEGIN & NOLINTEND. (https://github.com/Oneflow-Inc/oneflow/pull/8306)
Removed EXTERNAL_INCLUDE_DIRS , only builds with target. (https://github.com/Oneflow-Inc/oneflow/pull/8421)
Removed obsolete linkages in cmake. (https://github.com/Oneflow-Inc/oneflow/pull/8426)

CI

Improve the running speed and stability of CI

Supported CI to automatically upload built docs.(https://github.com/Oneflow-Inc/oneflow/pull/7894 https://github.com/Oneflow-Inc/oneflow/pull/7917)
Added CI test for IREE. (https://github.com/Oneflow-Inc/oneflow/pull/8419)
Printed the pip package in the container used to test in order to query version information easily. (https://github.com/Oneflow-Inc/oneflow/pull/7952)
Optimized the old version of SpeedTest. (https://github.com/Oneflow-Inc/oneflow/pull/7871 https://github.com/Oneflow-Inc/oneflow/pull/7990 https://github.com/Oneflow-Inc/oneflow/pull/8035)
Optimized the memory used by AutoTest. (https://github.com/Oneflow-Inc/oneflow/pull/7988)
Adjusted the threshold of benchmark. (https://github.com/Oneflow-Inc/oneflow/pull/8043)
Adjusted the timeout threshold. (https://github.com/Oneflow-Inc/oneflow/pull/8103)
Optimized the warning output related to __del__ in CI. (https://github.com/Oneflow-Inc/oneflow/pull/8049)
Optimized the interval of gc to improve the test speed. (https://github.com/Oneflow-Inc/oneflow/pull/8138)
Optimized the use of super Tensor in CI unit test to avoid gc too slow and slow down the running speed of CI. (https://github.com/Oneflow-Inc/oneflow/pull/8177)
Optimized the number of CI build to improve the speed of build. (https://github.com/Oneflow-Inc/oneflow/pull/8229)
Optimized CI workflow, stops all workflows when a job fails. (https://github.com/Oneflow-Inc/oneflow/pull/8255)
Increased maximum parallelism 5 -> 10. (https://github.com/Oneflow-Inc/oneflow/pull/8259)
Strict CI timeout-minutes. (https://github.com/Oneflow-Inc/oneflow/pull/8266)
Supported optional multi-machine testing via the need-test-distributed tag. (https://github.com/Oneflow-Inc/oneflow/pull/8372)
Tried to use a distributed test cache when testing on multiple machines. (https://github.com/Oneflow-Inc/oneflow/pull/8387/files)
Optimized the test time of global test. (https://github.com/Oneflow-Inc/oneflow/pull/8468)
Optimized the execution time of test_math_ops, test_loss, test_activation, test_tensor_part1, test_tensor_part2 and other eager test. (https://github.com/Oneflow-Inc/oneflow/pull/8494)
Optimized test_convtranspose, test_einsum, test_sqrt_square_sum in expensive eager test. (https://github.com/Oneflow-Inc/oneflow/pull/8504)

Models

Added the test of LiBai in CI. (https://github.com/Oneflow-Inc/oneflow/pull/7537, https://github.com/Oneflow-Inc/oneflow/pull/7929)
Fixed the speed test for Swin-Transformer. (https://github.com/Oneflow-Inc/oneflow/pull/7840)
Added the benchmark test for flow-vision.(https://github.com/Oneflow-Inc/oneflow/pull/7806, https://github.com/Oneflow-Inc/oneflow/pull/8024)
Added compatibility tests for conv_mixer, densenet, ghostnet, googlenet, inception_v3, mnasnet, rexnet, rexnet_lite, res2net, shufflenet_v2, squeezenet, convnext, crossformer, efficientnet, levit, mlp_mixer, poolformer, pvt, res_mlp, uniformer, swin_transformer, senet and other models. Fixes such compatibility issues as conv2d module padding parameter does not support string; the parameter list of functional.layer_norm is not aligned; meshgrid does not support the input of list[tensor]; adds a interface for tensor.reshape_as. (https://github.com/Oneflow-Inc/oneflow/pull/7942)
Fixed the bug of Swin-Transformer dataloader. (https://github.com/Oneflow-Inc/oneflow/pull/8037)
Added single-node 4-Gpus tests for models such as InsightFace in oneflow_face repository. (https://github.com/Oneflow-Inc/oneflow/pull/8130)

Bug fixes

Graph

Fixed the bug of nccl deadlock caused by CUDA kernel asynchronous launch limit for nccl logical kernel in 3-D parallelism. (https://github.com/Oneflow-Inc/oneflow/pull/7924)
Fixed cycle import of scope and session. (https://github.com/Oneflow-Inc/oneflow/pull/7993)
Used log_softmax + nll to make sparse_softmax_cross_entropy ms more stable numerically for calculating subgraphs. (https://github.com/Oneflow-Inc/oneflow/pull/7987)
Fixed the bug that B2P boxing misses TaskEdge lbi. (https://github.com/Oneflow-Inc/oneflow/pull/8052)
Fixed the problem that compilation fails due to eager free tensor is not in nn.Graph's job. (https://github.com/Oneflow-Inc/oneflow/pull/8114)
Fixed the possible problem of SegmentFault caused by BlobDesc. (https://github.com/Oneflow-Inc/oneflow/pull/8252)
Solved the bug of circular import in python 3.6. (https://github.com/Oneflow-Inc/oneflow/pull/8268)
Solved the problem that Graph's input and parameter/buffer tensors fail to handle non-contiguous tensors.(https://github.com/Oneflow-Inc/oneflow/pull/8281)
Solved the potential deadlock caused by inconsistent partial order execution of multiple ranks in 3-D parallelism. （https://github.com/Oneflow-Inc/oneflow/pull/8226）
Fixed the bug that Ibverbs failed to start the environment due to incorrect mtu value in special network environment. (https://github.com/Oneflow-Inc/oneflow/pull/8451)
Solved the potential deadlock caused by the partial order execution of each rank when the subsequent subgraph of GradAcc is inserted into the NCCL logical op; at the same time, traverse the subsequent subgraph of GradAcc more comprehensively to solve the problem of missing NCCL op. (https://github.com/Oneflow-Inc/oneflow/pull/8459)
Fixed the bug that NCCL logical kernels does not support bool type. (https://github.com/Oneflow-Inc/oneflow/pull/8455)
Fixed the bug of tensor detach and clone in Graph. (https://github.com/Oneflow-Inc/oneflow/pull/8498)

Eager

Aligned DataLoader.__next__ interface (https://github.com/Oneflow-Inc/oneflow/pull/7835)
Fixed backtracking failure when calculating higher-order derivatives, which is caused by the capturing of forward detached tensors via AutoGrad
Fixed inadequate execution of the semantics of sync by Barrier Instruction (https://github.com/Oneflow-Inc/oneflow/pull/7702)
Fixed memory leak caused by imperfect management of VM instruction count
Fixed getitem when tensor device id is not in the current rank
Fixed global norm error on gradient calculation for various placements when calling clip grad in pipeline parallelism in eager global mode (https://github.com/Oneflow-Inc/oneflow/pull/7879)
Fixed possible int32 arithmetic overflow caused by Shape.elem_cnt (https://github.com/Oneflow-Inc/oneflow/pull/8178)
Fixed incorrect results produced by Module.to_global when introducing parameters (https://github.com/Oneflow-Inc/oneflow/pull/8187)
Fixed extra GPU memory usage in flow.load and module.load_state_dict (https://github.com/Oneflow-Inc/oneflow/pull/8301)
Fixed extra GPU memory usage when Optimizer loads models (https://github.com/Oneflow-Inc/oneflow/pull/8310)
Fixed the error occurs when loading models via flow.load in multi nodes (https://github.com/Oneflow-Inc/oneflow/pull/8314)
Fixed instability of eager caused by the introduction of callback thread (https://github.com/Oneflow-Inc/oneflow/pull/8193)
Fixed tensor.from_numpy interface to avoid memory leak when the input of numpy is non-contiguous tensor (https://github.com/Oneflow-Inc/oneflow/pull/8391)
Fixed stack overflow when destructing the deep backward computational graph after recursion (https://github.com/Oneflow-Inc/oneflow/pull/8056)

Operators & Tensor

Global Tensor

Fixed global SBP inference of unfold (https://github.com/Oneflow-Inc/oneflow/pull/7883)
Fixed global SBP inference of grid_sample (https://github.com/Oneflow-Inc/oneflow/pull/7881)
Fixed incorrect pass of values in slice boxing kernel in certain cases (https://github.com/Oneflow-Inc/oneflow/pull/7893)
Fixed eager global inplace (https://github.com/Oneflow-Inc/oneflow/pull/7903)
Fixed SBP inference of upsample op (https://github.com/Oneflow-Inc/oneflow/pull/7884)
Fixed SBP inference of ScatterAdd, ScatterUpdate, and ScatterScalarUpdate (https://github.com/Oneflow-Inc/oneflow/pull/7807)
Fixed backward memory error of partial_fc with Global Tensor (https://github.com/Oneflow-Inc/oneflow/pull/8041)
Added support for S0 in randperm and fixed equal local tensors across all ranks in random op in Split (https://github.com/Oneflow-Inc/oneflow/pull/7571)
Fixed tensor getitem index error in global (https://github.com/Oneflow-Inc/oneflow/pull/8153)
Fixed SBP inference of RoiAlign and added global unit test (https://github.com/Oneflow-Inc/oneflow/pull/7794)
Fixed SBP inference of stack op (https://github.com/Oneflow-Inc/oneflow/pull/8181)
Fixed random initialization in median under CPU global (https://github.com/Oneflow-Inc/oneflow/pull/8245)
Fixed SBP inference of narrow op and added global unit test for narrow and chunk (https://github.com/Oneflow-Inc/oneflow/pull/7750)
Improved legal SBP list of batch_matmul (https://github.com/Oneflow-Inc/oneflow/pull/8385)
Fixed NLLLoss’ failure to support model parallelism (https://github.com/Oneflow-Inc/oneflow/pull/8380)
Fixed S->S and S->P inference in Slice Op SBP infer (https://github.com/Oneflow-Inc/oneflow/pull/8521)

Tensor

Fixed the bug occurs when Tensor dim is set to -1
Fixed failure for Tensor type to be directly transferred to int and float in Python (https://github.com/Oneflow-Inc/oneflow/pull/7927)
Fixed the bug in Tensor.is_contiguous that skips initialization when caching and executes random initialization when getting values (https://github.com/Oneflow-Inc/oneflow/pull/7785)
Fixed the bug in Tensor slice view under 1d contiguous (https://github.com/Oneflow-Inc/oneflow/pull/7898)
Fixed incorrect processing of None value by Tensor.__eq__ (https://github.com/Oneflow-Inc/oneflow/pull/7938)
Fixed unaligned memory size in from_numpy interface (https://github.com/Oneflow-Inc/oneflow/pull/7963)
Fixed incorrect initialization of random seed in Tensor (https://github.com/Oneflow-Inc/oneflow/pull/7904)
Fixed failure of oneflow.Size to create Tensor with a specified shape (https://github.com/Oneflow-Inc/oneflow/pull/8429)
Aligned alpha parameter in Tensor.add (https://github.com/Oneflow-Inc/oneflow/pull/8140)

Scalar Tensor

Fixed failure of add to support Scalar Tensor (https://github.com/Oneflow-Inc/oneflow/pull/7827)
Fixed failure of reduce_sum to support Scalar Tensor (https://github.com/Oneflow-Inc/oneflow/pull/7866)
Fixed failure of one_hot to support Scalar Tensor (https://github.com/Oneflow-Inc/oneflow/pull/7975)

Fixed failure of gather to support Scalar Tensor (https://github.com/Oneflow-Inc/oneflow/pull/8376)

Fixed “memory access out of bounds” error in dim_scatter kernel under Scalar Tensor (https://github.com/Oneflow-Inc/oneflow/pull/8418)
Fixed failure of start and end parameters in arrange op to support Scalar Tensor (https://github.com/Oneflow-Inc/oneflow/pull/8522)
Fixed failure of all to support Scalar Tensor and 0-Size Tensor (https://github.com/Oneflow-Inc/oneflow/pull/8547)

0-Size Tensor

Fixed failure of conv and deconv to support 0-Size Tensor (https://github.com/Oneflow-Inc/oneflow/pull/8001)
Fixed failure of cuda_check_numerics to support 0-Size Tensor (https://github.com/Oneflow-Inc/oneflow/pull/8050)
Fixed failure of expand and advanced_index to support 0-Size Tensor (https://github.com/Oneflow-Inc/oneflow/pull/8094)
Fixed the bug occurs when processing 0-Size Tensor in repeat_interleave kernel and removed relevant special judge in gather (https://github.com/Oneflow-Inc/oneflow/pull/8414)
Fixed failure of diag to support 0-Size Tensor (https://github.com/Oneflow-Inc/oneflow/pull/8557)

Operators

Fixed sorting in nms unit test (https://github.com/Oneflow-Inc/oneflow/pull/7831)
Fixed torch alignment of beta and threshold interfaces of softplus op (https://github.com/Oneflow-Inc/oneflow/pull/7888)
Fixed failure of expand to support passing tuples as parameters (https://github.com/Oneflow-Inc/oneflow/pull/7913)
Fixed computation failure in randperm when n is too large (https://github.com/Oneflow-Inc/oneflow/pull/7908)
Fixed failure relative to list or tuple in parameter passing in meshgrid (https://github.com/Oneflow-Inc/oneflow/pull/7933)
Fixed nn.functional.conv2d bug that all parameters must be specified (https://github.com/Oneflow-Inc/oneflow/pull/7892)
Fixed failure of rand and randn to support tuple as an input (https://github.com/Oneflow-Inc/oneflow/pull/7914)
Fixed the bug occurs in concat when inputs are of inconsistent data types (https://github.com/Oneflow-Inc/oneflow/pull/7921)
Fixed wrong device id got by generator in certain cases in randn,dropout, randint, rand, random_mask_like, and randperm (https://github.com/Oneflow-Inc/oneflow/pull/7896)
Fixed inconsistent behaviors of __shfl_sync under sm_61 in layernorm (https://github.com/Oneflow-Inc/oneflow/pull/7978)
Fixed failure of scatter op to support negative dim (https://github.com/Oneflow-Inc/oneflow/pull/7934)
Fixed the bug in scatter op nd update value(https://github.com/Oneflow-Inc/oneflow/pull/7953)
Fixed failure of masked_select to support certain Broadcast operations in eager mode (https://github.com/Oneflow-Inc/oneflow/pull/7984)
Fixed the bug in PReLU op when dispatching num_blocks (https://github.com/Oneflow-Inc/oneflow/pull/8004)
Fixed misused numpy forced synchronization logic in index_select python and transferred the logic to functor for implementation (https://github.com/Oneflow-Inc/oneflow/pull/7965)
Aligned dtype parameter in prod (https://github.com/Oneflow-Inc/oneflow/pull/7932)
Fixed the bug occurs when ord = 0 in linalg.vector_norm op; Fixed check on nan/inf by clip_grad (https://github.com/Oneflow-Inc/oneflow/pull/8007)
Fixed failure of min and max to operate on inconsistent dtypes (https://github.com/Oneflow-Inc/oneflow/pull/8021)
Added num_batches_tracked buffer to batch_norm to facilitate transfer of ResNet-18, a torch pretrained model, to OneFlow (https://github.com/Oneflow-Inc/oneflow/pull/7920)
Fixed the misuse of logf, expf, and powf in math kernel (https://github.com/Oneflow-Inc/oneflow/pull/8038)
Fixed exclusion of dtype parameters in cumsum and cumprod and provided Tensor.cumsum and Tensor.cumprod methods (https://github.com/Oneflow-Inc/oneflow/pull/8065)
Fixed possible overflow when dtype is not int64 in non_zero op (https://github.com/Oneflow-Inc/oneflow/pull/7907)
Aligned sum, mean, all, any, and prod operations in reduce (https://github.com/Oneflow-Inc/oneflow/pull/8085)
Fixed incorrect backward computation in cumprod (https://github.com/Oneflow-Inc/oneflow/pull/8136)
Aligned alpha parameter in sub operation (https://github.com/Oneflow-Inc/oneflow/pull/8026)
Fixed shape inference in upsample op (https://github.com/Oneflow-Inc/oneflow/pull/8105)
Fixed failure of addn inplace operation on CPU tensor (https://github.com/Oneflow-Inc/oneflow/pull/8280)
Fixed limit on tensor size in cum backward op based on the size of shared memory (https://github.com/Oneflow-Inc/oneflow/pull/8289)
Improved the logic of dtype inference for arange op (https://github.com/Oneflow-Inc/oneflow/pull/8338)
Fixed NaN propagation of UnaryFunctor (https://github.com/Oneflow-Inc/oneflow/pull/8346)
Fixed ndim check of pad (https://github.com/Oneflow-Inc/oneflow/pull/8354)
Fixed vector check in broadcast_min and broadcast_max backward computations (https://github.com/Oneflow-Inc/oneflow/pull/8379)
Fixed the bug relative to index computation logic in cumprod op (https://github.com/Oneflow-Inc/oneflow/pull/8388)
Fixed possible int32 overflow in softmax and math unary / binary cuda kernel; for kernels that operate integer division on i in CUDA_1D_KERNEL_LOOP, provided if statement to branch computations to prevent performance loss in most cases when int32 works (https://github.com/Oneflow-Inc/oneflow/pull/8472)
Fixed failure to pass size via size=(...) in random ops (normal, rand, randn, randint, and randperm) (https://github.com/Oneflow-Inc/oneflow/pull/8506)

Device

Fixed error in cudaGetDeviceCount when CUDA device count=0 (https://github.com/Oneflow-Inc/oneflow/pull/8184)
Fixed possible unregistration of devices caused by hob.ToString method; Used static local variables to establish dependency between static variables of device registration and the static code for device registration (https://github.com/Oneflow-Inc/oneflow/pull/8235)
Fixed cudaErrorNoDevice caused by drive errors (https://github.com/Oneflow-Inc/oneflow/pull/8262)
Fixed memory leak caused by realpath (https://github.com/Oneflow-Inc/oneflow/pull/8540)

Higher order derivative

Introduced AutogradCapturedTensor in backward computation to avoid circular reference and allow correct backtracking to the input gradient node in higher order derivative graph (https://github.com/Oneflow-Inc/oneflow/pull/7808)
Added higher order derivative of sin/cos op; Fixed autograd bugs relative to higher order derivative (https://github.com/Oneflow-Inc/oneflow/pull/8163)
Fixed bugs in backward computation in concat and split_like to support higher order derivative (https://github.com/Oneflow-Inc/oneflow/pull/8208)

Build

Fixed RTD [sphinx] failure to build docstr (https://github.com/Oneflow-Inc/oneflow/pull/7901)
Fixed compilation failure caused by opencv copy header failure (https://github.com/Oneflow-Inc/oneflow/pull/7944)
Fixed failure to generate a new .so in compilation when CMAKE_LINK_DEPENDS_NO_SHARED=YES (https://github.com/Oneflow-Inc/oneflow/pull/7868)
Fixed Eigen url in cmake third party (https://github.com/Oneflow-Inc/oneflow/pull/8223)
Fixed the bug caused by multi-time linking to libof_protoobj in XRT (https://github.com/Oneflow-Inc/oneflow/pull/8326)
Made libproto a dynamic library to avoid collision between static global variables (https://github.com/Oneflow-Inc/oneflow/pull/8345)
Made of_pyext_obj static only when there is one Python extension dynamic library that has Python symbols (https://github.com/Oneflow-Inc/oneflow/pull/8393)
Fixed the bug in undefined symbol: del_curterm in source code compilation (https://github.com/Oneflow-Inc/oneflow/issues/8398)
Fixed false positive warning in gcc11 compilation (https://github.com/Oneflow-Inc/oneflow/pull/8401)
Fixed SegFault that occurs when unzipping dataset in the container by making zlib a dynamic library (https://github.com/Oneflow-Inc/oneflow/pull/8481)
Fixed undefined reference of culibosTlsSetValue (https://github.com/Oneflow-Inc/oneflow/pull/8479)
Fixed stringop-truncation compilation error for gcc9 (https://github.com/Oneflow-Inc/oneflow/pull/8532)

CI

Disabled static link of Simple CI and enabled debug build to avoid too many symbols (https://github.com/Oneflow-Inc/oneflow/pull/7940)
Fixed the bug in AutoTest fake program; Fixed print error in AutoTest (https://github.com/Oneflow-Inc/oneflow/pull/8279; https://github.com/Oneflow-Inc/oneflow/pull/8290)

Module

Disabled conv3d test temporarily for its relatively large error of random values (https://github.com/Oneflow-Inc/oneflow/pull/7969)
Reduced test error in nn.LayerNorm (https://github.com/Oneflow-Inc/oneflow/pull/7941)
Optimized input data range of certain math op tests (https://github.com/Oneflow-Inc/oneflow/pull/8010)
Fixed incorrect unit test case in permute (https://github.com/Oneflow-Inc/oneflow/pull/8083)
Aligned error message of chunk to torch (https://github.com/Oneflow-Inc/oneflow/pull/8096)
Fixed incorrect use of permute in tensor tests (https://github.com/Oneflow-Inc/oneflow/pull/8144)
Fixed omission of test cases in instancenorm (https://github.com/Oneflow-Inc/oneflow/pull/8215)
Adjusted unit test threshold for leaky_relu (https://github.com/Oneflow-Inc/oneflow/pull/8242)
Annotated cpu bn grad method that tests with random values (https://github.com/Oneflow-Inc/oneflow/pull/8257)
Skipped test cases of global argmax and median in multi-GPU scenarios (https://github.com/Oneflow-Inc/oneflow/pull/8264)
Adjusted unit test threshold for fused_dot_feature_interaction (https://github.com/Oneflow-Inc/oneflow/pull/8293)
Disabled unit tests for conv_transpose1d, conv_transpose2d, and conv_transpose3d (https://github.com/Oneflow-Inc/oneflow/pull/8319)
Adjusted tolerance setting in embedding_renorm unit test (https://github.com/Oneflow-Inc/oneflow/pull/8394)
Removed test cases with excessive accumulated elements in test_fused_dot_feature_interaction_pooling_sum to avoid overly large sum error (https://github.com/Oneflow-Inc/oneflow/pull/8425)

Documentation

Ensured that all PyTorch references in OneFlow API documentation belong to the same PyTorch version (1.10.0) (https://github.com/Oneflow-Inc/oneflow/pull/8058)
Added "copy" button for code in API docs to facilitate trial runs of sample code (https://github.com/Oneflow-Inc/oneflow/pull/7997)
Refined script that automatically generates version status for OneFlow APIs and fixed bugs in docs (https://github.com/Oneflow-Inc/oneflow/pull/8546)
Refined interface documentation of Tensor and Module (https://github.com/Oneflow-Inc/oneflow/pull/7823)
- Refined Tensor.to_global interface documentation and added descriptions of gard_sbp
- Refined Tensor.to_local interface documentation
- Added Tensor Attributes docs for oneflow.placement, oneflow.env.all_device_placement, and oneflow.sbp.sbp
- Added interface documentation for Module.to_consistent (outdated) and Module.to_global
Fixed invalid links in Tensor docs and updated consistent to global (https://github.com/Oneflow-Inc/oneflow/pull/7821)
Added docstr for Tensor.sqrt, Tensor.square, Tensor.addmm, Tensor.cosh, Tensor.diagonal, Tensor.log, Tensor.ndim, and Tensor.rsqrt (https://github.com/Oneflow-Inc/oneflow/pull/7841)
Enabled derived classes of pybind11 to add documentation for non-overriding methods and added interface documentation related to Tensor and autograd (https://github.com/Oneflow-Inc/oneflow/pull/7849)
Refined documentation of oneflow.argsort (https://github.com/Oneflow-Inc/oneflow/pull/7844)
Refined documentation of Tensor.zero_, Tensor.is_contiguous, Tensor.is_cuda, and oneflow.nn.functional.layer_norm op (https://github.com/Oneflow-Inc/oneflow/pull/7839)
Refined interface documentation of support_sparse and step in oneflow.optim.Adamw, oneflow.optim.SGD (https://github.com/Oneflow-Inc/oneflow/pull/7848)
Refined interface documentation of LambdaLR.step, ReduceLROnPlateau.in_cooldown, and ReduceLROnPlateau.is_better (https://github.com/Oneflow-Inc/oneflow/pull/7848)
Refined interface documentation of nn.Module (https://github.com/Oneflow-Inc/oneflow/pull/8190)
Refined interface documentation of oneflow.optim.lr_scheduler.PolynomialLR (https://github.com/Oneflow-Inc/oneflow/pull/8430)
Refined docs and formula illustrations for oneflow.nn.CombinedMarginLoss (https://github.com/Oneflow-Inc/oneflow/pull/8206)
Refined documentation of oneflow.logical_and, oneflow.logical_or, oneflow.logical_xor, and oneflow.logical_not (https://github.com/Oneflow-Inc/oneflow/pull/8297)
Fixed the bug in the documentation of quantization ops (https://github.com/Oneflow-Inc/oneflow/pull/8333)
Updated solution in Troubleshooting for the case when libunwind.h is not found (https://github.com/Oneflow-Inc/oneflow/pull/8336)
Restructured API documentation based on features; added and refined docs of features that are unique to OneFlow (https://github.com/Oneflow-Inc/oneflow/pull/8392)

oneflow - Version 0.7.0

Published by jackalcooper over 2 years ago

OneFlow v0.7.0 Release Notes

OneFlow v0.7.0 came out. Welcome to use it. We would love to hear your feedback!

本文的中文版本

https://mp.weixin.qq.com/s/dSR-2Xw92eoFhF0c6MtutQ

Highlights

This release has the following highlights:

Provides a Tensor that can be executed in multi-nodes multi-GPUs scenarios: Global Tensor. It is an easy-to-use solution for distributed execution. It makes it easier to implement various distributed parallel strategies and enables more flexible and user-friendly distributed implementation. It supports models including ResNet50, Wide and Deep, GPT, Bert, Swin-Transformer, InsightFace, etc.
Continues to improve nn.Graph. Supports the advanced features such as ZeRO, GradAcc, Checkpointing, and Pipelining, and enriches the graph.debug mode. Supports random 2D SBP conversion, semi-automatic derivation of 2D SBP, resuming training from the last checkpoint, etc. Adds OneFlow Feature Stages Identifications and identifies each feature of nn.Graph. For nn.Graph, its basic features are at the Beta Stage, which can meet most of the requirements of users; Advanced features are at Alpha Stage, meeting standard requirements.
Deeply optimizes the performance of Eager mode. The performance of the Swin-Transformer model is 3 times higher than that of v0.6.0 when tested on the V100.
Operators-related improvements: In the single-node single-GPU scenario, OneFlow's compatibility with PyTorch is further improved. The interfaces, semantics, and produced results of operators supported by OneFlow are in consistent with that of operators supported by PyTorch and an automatic testing framework is designed to verify the consistency. With common models, you can accomplish the migration by running import oneflow as torch. Compared with v0.6.0, OneFlow adds 16 operators, optimizes the performance of 6 operators, and fixes bugs in 16 operators.
Supports Einsum and View mechanism.
Compiler-related improvements: OneFlow is officially connected to the MLIR ecosystem.
Releases OneFlow-Serving v0.1.0: We provide an out-of-the-box Triton OneFlow backend docker image. try here.
Releases LiBai v0.1.0, a toolbox for massively distributed parallel training of Transformer. Compared with customized code bases such as Megatron-LM, LiBai provides a series of models and training components for distributed training based on a modular design, aiming to make models trained in distributed mode as convenient as in single-GPU mode.
Releases Flow-Vision v0.1.0: adds DeiT, ConvNeXt, ReXNet, and other models and updates tutorials and documentation.

OneFlow Feature Stages identifications

OneFlow Feature Stages identifies the maturity level of OneFlow features. It provides users with a status description of a feature to inform the specific level of it, such as completeness, API stability, documentation, etc. It Provides OneFlow developers with a standard for feature refinement, which facilitates further improvement.

OneFlow Feature Stages

Stable Stage
- Purpose: release for production use
- Audience: all users
- Functionality: same as RC
- Testing: same as RC
- Performance: same as RC
- API: same as RC, with stability within long cycles (e.g., 1 year) and large versions (e.g., 1.0)
- Documentation: same as RC
Release Candidate (RC) Stage
- Purpose: release for deployment evaluation in production environments
- Audience: all users, including those who want to deploy production environments
- Functionality: being able to handle exceptions as well as normal inputs.
- Testing: end-to-end deployment validated in external environment with good experience
- Performance: provide evaluation reports and documentation to evaluate performance and scalability in external environments
- API: API for external user evaluation
- Documentation: features in this stage are added to the core-feature-set documentation
Beta Stage
- Purpose: release to provide a relatively stable, complete, and available version
- Audience: all users, especially those with strong feature demands, little concern for unknown trivial issues, and willingness to provide feedback
- Functionality: complete functionalities addressing the needs of various possible scenarios
- Testing: complete, covering various corner test cases, and various end-to-end integration tests
- Performance: performance evaluation and scalability evaluation
- API: recognized as complete and stable by seed users after full review
- Documentation: tutorials that describe the usage process
Alpah Stage
- Purpose: release to get early feedback for experimental features
- Audience: developers and expert users
- Functionality: core functionality completed
- Testing: unit testing completed for core requirements of the feature, possibly with unknown bugs
- Performance: evaluated
- API: well-defined but not rigorously reviewed, possibly requiring further changes
- Documentation: API documentation is a must to provide feature definitions
Pre-alpha Stage
- Purpose: release to validate feature prototypes or address urgent needs
- Audience: feature developers
- Functionality: limited prototype functionalities
- Testing: limited testing, possibly with many bugs
- Performance: unknown
- API: prone to changes
- Documentation: possibly none

OneFlow Framework

1. Distribution

Global Tensor

Global Tensor is a newly released set of distributed computing interfaces. It can easily support any parallelism including data parallelism, model parallelism, and pipeline parallelism. Unlike a normal Tensor (hereafter called Local Tensor), Global Tensor is a Tensor with a global view, whose data is distributed in a specific way across a set of devices in a cluster, and each node stores some or all of the Global Tensor's data. Placement and SBP are the basic properties of the Global Tensor that describe the distribution of the data in clusters.

Global Tensor's data distribution

Global Tensor supports three different ways of data distribution, which we collectively refer to as SBP.

Split (dim): The data is equally split along dim dimension and distributed to each device.
Broadcast: The data is replicated between each device.
PartialSum: The data is the element-wise addition for each device.

Consistent computational interfaces

Global Tensor has basically the same computational interfaces as Local Tensor. Only with small changes, you can convert the single-GPU mode to the distributed mode.

>>> import oneflow as flow
>>> x = flow.tensor([1.0, 2.0])
>>> y = x * x

>>> import oneflow as flow
>>> x = flow.tensor([1.0, 2.0],
placement=flow.placement("cuda", ranks=[0, 1]),
sbp=flow.sbp.split(0))
>>> y = x * x
# This multiplication is performed on both rank 0 and rank 1

Supporting conversion between Local Tensor and Global Tensor

With Tensor.to_global interface, you can create a Global Tensor based on a Local Tensor, and regard this tensor as the local tensor of the Global Tensor on the present device.
With Tensor.to_local interface, you can return the local tensor of the Global Tensor on the present device.

>>> import oneflow as flow
>>> x = flow.tensor([1.0, 2.0],
placement=flow.placement("cuda", ranks=[0, 1]),
sbp=flow.sbp.split(0))
>>> y = x.to_local()
>>> y.size()
oneflow.Size([1])
>>> y
tensor([1.], device='cuda:0', dtype=oneflow.float32)
# tensor([2.], device='cuda:0', dtype=oneflow.float32) if rank is 1

Supporting redistribution of Global Tensor in clusters

With Tensor.to_global interface, you can redistribute the data of Global Tensor in clusters. The data can be distributed to another set of nodes and the way of distribution in this set of nodes can also be changed (i.e.change SBP). Redistribution usually generates inter-process data communication, but Tensor.to_global interface finely avoids complicated low-level communication details.

>>> import oneflow as flow
>>> x = flow.tensor([1.0, 2.0], placement=flow.placement("cuda", ranks=[0, 1]), sbp=flow.sbp.split(0))
>>> y = x.to_global(placement=flow.placement("cuda", ranks=[2, 3]), sbp=flow.sbp.broadcast)

Each operator of OneFlow defines a set of SBP signatures for the input and output tensor. Global Tensor supports automatic redistribution to provide the required SBP signature of a certain interface. Just as the code shown below:

>>> import oneflow as flow
>>> x = flow.randn(4, 4, 
            placement=flow.placement("cuda", ranks=[0, 1]), 
            sbp=flow.sbp.split(0))
>>> y = flow.randn(4, 4, 
            placement=flow.placement("cuda", ranks=[0, 1]), 
            sbp=flow.sbp.split(1))
>>> z = x + y

When x + y is executed, since x is split along 0 dimension while y is split along 1 dimension, their local tensors at each device can not be added up directly. Therefore, x's SBP will be automatically converted to flow.sbp.split(1) or y's SBP will be converted to flow.sbp.split(0), and the calculated result-z's SBP- is flow.sbp.split(1) or flow.sbp.split(0).

Notes

Global Tensor doesn't support mix-in with DDP interface currently.
Global Tensor requires all devices to execute simultaneously, and the code that has branches would lead to process deadlock because of divergent execution paths. We will continue fixing this problem.

2. Continued improvement of nn.Graph's features

Overview of the development of nn.Graph v0.7.0

Fundamental features enter into Beta Stage, meeting most requirements of users;
Advanced features enter into Alpha Stage, meeting standard requirements of users;
ResNet50, Wide and Deep, GPT, Bert, Swin-Transformer, InsightFace, and other models are supported；

Feature of nn.Graph

Static and dynamic casting of operators under Static Graph enter into Beta Stage from Alpha Stage
- Adds the unit test of static execution for all legal operators under nn.Graph, and automated unit test is ready;
- Supports more flexible inputs and outputs, including List/Tuple/Dict and their nesting, and fixs the Tuple problem of producing a return size of "1";
- Adds backward automatic test;
Optimizer and LR Scheduler under Static Graph enter into Beta Stage from Alpha Stage.
- Adds more built-in LR schedulers, including WarmupLR, CosineAnnealingWarmRestarts and other common schedulers, and provides SequentialLR and ChainedScheduler to enable scheduler with different combination capacity;
- Refactors scheduler's get_lr function, converting it to the implementation of pure function. This change permits to use schedulers in combination by changing the calculation of lr from iterative solution to analytical solution;
- Adds "is_sparse" parameter for add_optimizer interface, supporting sparse updates under graph mode. Optimizers that support sparse updates include Adam and SGD, while optimizers under Eager mode don't support sparse updates yet. Subsequent version will support both sparse updates and sparse tensor. The feature is at Pre-alpha Stage;
- Adds Debug print feature for LR and Step, for which you only need to turn on LR Scheduler's verbose button.
state_dict and load_state_dict under Static Graph are newly added, which allow to resume training from last checkpoint. The feature is at Beta Stage;
Debug under Static Graph enters into Beta Stage from Alpha Stage;
- Adds debug(2)、debug(3) that allow to find out problems in nn.Module, by locating the Python code of operators at c++ layer and locating forward graph creation and inference for operators;
- Adds the display of memory overhead
ZeRO-DP under Static Graph is newly added, which allows to reducememory overhead related to Optimizer under data parallelism, and the feature is at Alpha Stage;
Global Tensor under Static Graph supports multiple parallel methods, and the feature is between Alpha Stage and Beta Stage;
- It is utilized in LiBai and other model libraries;
- It is widely utilized in OneFlow's model libraries, and the coverage of unit test is still ongoing;
- 1D Global Tensor supports you to only define input tensor's SBP, while output tensor's SBP can be derived automatically with good results, and the feature is at Beta Stage;
- 2D Global Tensor supports you to only define input tensor's SBP, while output tensor's SBP can be derived automatically with good results, and the feature is at Alpha Stage;
- Conversion from 1D to ND or ND to 1D is newly supported, and the feature is at Alpha Stage;
- Random conversion of 2D SBP is newly supported, and the feature is at Alpha Stage；
- Testing of 1D&2D single operator is still ongoing, and the feature is at Pre-alpha Stage；
- Selecting SBP with semi-automatic derivation is supported, and the feature is at Pre-alpha Stage；
For Gradient Accumulation under Static Graph, we refactor and repair support for Reshape and add API documentation. For the input of mini-batch interface, the future version will offer the input of micro-batch with better experience, and the feature is from Pre-Alpha to Alpha Stage；
For pipeline parallelism under Static Graph, the tutorial is perfected, and pipeline parallelism is available in Libai and other model libraries. The feature is at Beta Stage;
For automatic mixed precision (AMP) under Static Graph, the API documentation is newly added. The feature is from Pre-Alpha to Alpha Stage；
For Activation Checkpointing under Static Graph, the API documentationis newly added. The feature is from Pre-Alpha to Alpha Stage;
For Op Fuse optimization under Static Graph, the API documentationis newly added. The feature is from Pre-Alpha to Alpha Stage;
For XLA/TensorRT/OpenVINO execution under Static Graph, the API documentationis newly added. The feature is from Pre-Alpha to Alpha Stage;

Tutorials

API Documentation

Tutorials of pipeline parallelism：

Model support under nn.Graph

Training ResNet50 with single-node single-GPU or single-node multi-GPU is supported, https://github.com/Oneflow-Inc/models/tree/main/Vision/classification/image/resnet50
Wide and Deep model is supported, https://github.com/Oneflow-Inc/models/tree/main/RecommenderSystems/wide_and_deep
GPT、Bert、Swin Transformer in Libai are supported, https://github.com/Oneflow-Inc/libai
Functioanl problems in support for above models are resolved;

3. Performance optimization of Eager

The performance of Eager is deeply optimized. When OneFlow run Swin-Transformer's model performance on V100 GPU, single-GPU card delivers a 25% speedup than PyTorch, and 8 single GPU card 10% speedup;
The communication scheduling policy for NCCL in DDP is optimized;
DDP supports the optimization of AllReduce fuse, reducing additional overhead generated by fragmented AllReduce, with a 5% performance speedup when it is tested on ResNet50;
VM supports the optimization of instruction fusion, significantly saving scheduling overhead of Kernel;
Additional memory overhead is optimized when CPU overload is too high;
Eager DataLoader supports the optimization of inter-process memory sharing;
The performance of Clip Grad is optimized;

4. Improvements of operators

OneFlow is successfully adapted to oneDNN for CPU operators acceleration.

The performance of CPU operators such as unary and binary element-wise is improved by 4 times, and the speed of Swin-Transformer's dataloader is improved by 2.5 times. https://github.com/Oneflow-Inc/oneflow/pull/7319

Adds the functionality of inter-process shared memory to Dataloader, which greatly improves the performance of DataLoader in DDP.
Adds Bool type Tensor. https://github.com/Oneflow-Inc/oneflow/pull/7523
Realizes to_contiguous that view relied on. https://github.com/Oneflow-Inc/oneflow/pull/7670
Adds Scalar div operators. https://github.com/Oneflow-Inc/oneflow/pull/7483
Adds Lamb optimizer. https://github.com/Oneflow-Inc/oneflow/pull/7389
Adds Polynomial Learning Rate Scheduler. https://github.com/Oneflow-Inc/oneflow/pull/7260
Adds tensor_split and as_strided operators. https://github.com/Oneflow-Inc/oneflow/pull/7258 & https://github.com/Oneflow-Inc/oneflow/pull/7275
Adds cumprod operators. https://github.com/Oneflow-Inc/oneflow/pull/7278
Adds Tensor.T() and oneflow.t() operators. https://github.com/Oneflow-Inc/oneflow/pull/7269
Adds normalize operators. https://github.com/Oneflow-Inc/oneflow/pull/7113
Adds the inplace version of div and sub operators. https://github.com/Oneflow-Inc/oneflow/pull/7293
Adds the feature of Module.zero_grad. https://github.com/Oneflow-Inc/oneflow/pull/7587/
Adds the feature of Scalar Tensor being the index to do list indexing. https://github.com/Oneflow-Inc/oneflow/pull/7597
Adds support for Leaky ReLU operators half type. https://github.com/Oneflow-Inc/oneflow/pull/7569
Adds support for mask select operators. https://github.com/Oneflow-Inc/oneflow/pull/7492
Adds non-reduce communication operations such as Bool type Broadcast and Allgather. https://github.com/Oneflow-Inc/oneflow/pull/7366
Develops autotest that supports eager global based on an autotest framework. https://github.com/Oneflow-Inc/oneflow/pull/7204
Optimizes performance for ReduceSum CUDA Kernel. https://github.com/Oneflow-Inc/oneflow/pull/7684
Optimizes CUDA Kernel of gather operators. https://github.com/Oneflow-Inc/oneflow/pull/7351
Optimizes the performance for CUDA Kernel of MaxPool and AvgPool operators in NCHW. https://github.com/Oneflow-Inc/oneflow/pull/7426 & https://github.com/Oneflow-Inc/oneflow/pull/7451
Optimizes the backward computing of PReLU operators, which can save more memory in general. https://github.com/Oneflow-Inc/oneflow/pull/7600
Optimizes backward Kernel of LayerNorm to further save memory. https://github.com/Oneflow-Inc/oneflow/pull/6996
Supports passing single int in stride and dilation in Conv1D/2D/3D and DeConv1D/2D/3D Kernel. Adds Tensor.zero_() interface that aligns with PyTorch tensor.norm, torch.max and torch.min.
Supports inplace in flow.nn.functional.dropout. https://github.com/Oneflow-Inc/oneflow/pull/7593
Fixes bug where the BatchNorm module raises an error when affine=False. https://github.com/Oneflow-Inc/oneflow/pull/7755
Fixes Maximum and Mimimum backward bug. https://github.com/Oneflow-Inc/oneflow/pull/7519
Fixes bug where the result of var operators is unexpected in some cases. https://github.com/Oneflow-Inc/oneflow/pull/7517
Fixes incorrect behavior of Tensor deepcopy bug. https://github.com/Oneflow-Inc/oneflow/pull/7490
Fixes bug where input index is scalar tensor in slice operators. https://github.com/Oneflow-Inc/oneflow/pull/7479
Fixes bug where BinaryCrossEntropy can produce nan in half. https://github.com/Oneflow-Inc/oneflow/pull/7476
Fixes bug where an error is raised when the base and exponent of pow operators are respectively real number type and Tensor type. https://github.com/Oneflow-Inc/oneflow/pull/7729
Fixes stack operators backward bug. https://github.com/Oneflow-Inc/oneflow/pull/7363
Fixes inefficiency problem caused by CPU synchronization when clip grad is executed on CUDA with the default configuration. https://github.com/Oneflow-Inc/oneflow/pull/7304
Fixes the SBP inference of Batch Gather and Unsorted Batch Segment Sum operators, and runs the global unittest successfully. https://github.com/Oneflow-Inc/oneflow/pull/7590
Fixes Physical Shape inference of Affine Grid operators, fixes the unexpected result bug in some SBP cases, and runs the global unittest successfully. https://github.com/Oneflow-Inc/oneflow/pull/7578
Fixes the problem that arange operators don't support generating 0 size tensor, and runs the global unittest successfully. https://github.com/Oneflow-Inc/oneflow/pull/7576
Fixes the incorrect SBP inference of flip operators, and runs the global unittest successfully. https://github.com/Oneflow-Inc/oneflow/pull/7496
Fixes advanced indexing and zeroslike operators SBP bugs. https://github.com/Oneflow-Inc/oneflow/pull/7238
Fixes bug where Eager global inplace might not be successful. https://github.com/Oneflow-Inc/oneflow/pull/7348

5. Supporting einsum & view mechanism

Adds einsum operators. einsum provides a set of concise but elegant rules, which can implement tensor operations including but not limited to: inner product, outer product, tensor multiplication, tensor transposition and tensor contraction, etc. Proficient use of einsum allows you to easily implement various complex tensor operations and be less error-prone. https://github.com/Oneflow-Inc/oneflow/pull/7526

Adds view mechanism. The view mechanism allows the common operators to reuse/share Tensor's memory, and the memory can be saved by reducing the Kernel Launch/Compute process. At present, new view operators that do not change the tensor.is_contiguous() property have been added, such as reshape, view, squeeze, unsqueeze, etc.: https://github.com/Oneflow-Inc/oneflow/pull/7503 More view operators will be added later (such as transpose, permute, narrow, expand, and unfold).

6. Improvements of the complier

OneFlow is officially connected to the MLIR ecosystem, and the OneFlow Dialect component is complete. Successfully completes OneFlow Job (computation graph of OneFlow nn.Graph) and RoundTrip of MLIR, and runs RoundTrip tests on all operators of OneFlow in CI process.
Implements static graph optimization with a series of automatic fused operators based on MLIR DRR to accelerate OneFlow model training and inference.

7. OneFlow Serving

OneFlow Serving v0.1.0 comes out with the following features:

Provides OneFlow C++ API used for inference, supporting model loading and static graph inference.
The model weights and the computation graph in MLIR format can be saved simultaneously by running flow.save(graph) in Python. They can be loaded in C++ API (while loading computation graph is not supported in Python API at present).
Supports inference of OneFlow model using TensorRT and OpenVINO automatically without model conversion (based on OneFlow XRT module), achieving better acceleration on NVIDIA GPU and Intel CPU.
Implements Triton OneFlow backend
- Provides out-of-the-box Docker image.
- Supports auto configuration: only the model path needs to be given, and no Triton configuration file needs to be written in the configuration.
Welcome to use the project deployed with Triton OneFlow backend launched on OneFlow Cloud Platform.

8. LiBai

LiBai is a toolbox for massively distributed parallel training of Transformer. Compared with custom code bases such as Megatron-LM, LiBai provides a series of models and training components for distributed training based on a modular design, aiming to make models trained in distributed mode as convenient as in single-GPU mode. The 0.1.0 version mainly supports the following features and models:

Features:

Data Parallelism
1D Tensor Parallelism
Pipeline Parallelism
Unified Distributed Layers
Extensible for new parallelism
Mixed Precision Training
Activation Checkpointing
Gradient Accumulation
Gradient Clip
ZeRO
More flexible "LazyConfig" configuration system
Easy-to-use Trainer and Evaluator
Data preprocessing supporting images and texts

Models:

Bert (3D Parallelism)
GPT-2 (3D Parallelism)
ViT (3D Parallelism)
Swin-Transformer (Data Parallelism)
Supports fine-tuning tasks in projects/
Supports text classification tasks in projects/

9. flow-vison

flowvision 0.1.0 stable version comes out with the following improvements based on the previous version:

Adds initialization method trunc_normal_
Adds DeiT model, rebuilt VisionTransformer model
Adds ConvNeXt model
Adds ReXNet model
Supports Learning Rate Schedule in PolyLRScheduler and TanhLRScheduler
Fixes the use of F.normalize in SSD model
Fixes bugs in EfficientNet and Res2Net
Fixes weights problem in vit_small_patch32_384 and res2net50_48w_2s models
Rebuilds model zoo and runs more complete tests on existing models
Rebuilds load_state_dict_from_url method to automatically save the downloaded weights in the cache folder
Improves documents about Getting Started and flowvision.models

The 0.2.0 version of flowvision is already in progress. A large number of new models will be added based on the 0.1.0 version, and the documentation will be improved, so stay tuned.

oneflow - Version 0.6.0

Published by jackalcooper almost 3 years ago

OneFlow v0.6.0 Release Notes

OneFlow has been open sourced for 528 days since July 31,2020. Today OneFlow v0.6.0 came out. Welcome to use OneFlow v0.6.0. We would love to hear feedback!

This version mainly updates three parts: framework, model, and OneFlow-ONNX. Hightlights include:

Performance optimization in static graphs, dynamic graphs, operators, memory occupation, etc
A larger number of common operators
Improvements in static graphs and ConsistentTensor
Serving functionality as Nvidia Triton's backend
Richer visual pre-training models similar to torchvision and timm
Better OneFlow-ONNX conversion functionality

The following are the detailed release notes.

Framework

1. Performance Optimization of nn.Graph

Compared to v0.5.0, nn.Graph in v0.6.0 delivers a 10% speedup in training on models such as ResNet AMP and WDL, etc
- Optimized nn.Graph's performance in high frequency iterative training scenarios
- Redesigned the scheduling instructions of nn.Graph and refactored the interaction logic between Actor Graph and Eager VM so that the runtime execution of the Graph is asynchronous and parallel to Python input/output Tensor as much as possible

2. Performance Optimization of Eager

Compared to v0.5.0, v0.6.0 OneFlow Eager's training speed increases dramatically in small batch scenarios
- Optimized the scheduling logic for virtual machines
- Optimized get/set item
- Optimized tensor.numel()
- Optimized oneflow.Size()

3. Performance Optimization of Operators

Optimized some operators that affect the performance of new model to significantly improve the training speed of these models
- Added fused dropout operators
- Added CPU-version group deconv and optimized its performance
- Added inplace-version implementation for operators mul, hard_sigmoid, and sin
- Optimized performance for linalg.vector_norm when ord=2.0 and it is 4 times faster than before
- Deeply optimized the LayerNorm operator, making its performance greatly better than PyTorch and Apex implementation. For more information, refer to How to Implement an Efficient LayerNorm CUDA Kernel — OneFlow Performance Optimization
- Realized automatic type promotion of operators. For more information, refer to Automatic Type Promotion of Operators in OneFlow

4. Performance Optimization of Eager's Memory Occupation

Optimized some operators' memory occupation during net training, making the same computing device run bigger models or data
- Optimized the backward memory occupation of broadcast binary operators
- Optimized the backward memory occupation of Slice operator
- Optimized the memory occupation of LayerNorm operator

5. More Useful Features to Static Computation Graph (nn.Graph)

The newly added features are related to the effeciency, debugging, completeness, and usability of static graphs
- To help the debugging of static graphs, we added the following features:
  - debug mode supports graph.debug(1) printing more information about the composition
  - Provided the environment variable ONEFLOW_DEBUG_PASS to show the changes in the computed graph before and after compile-time optimization
  - Added user-readable thread naming information to Nsight Profile for locating and retrieving target key thread locations
  - Added many static graph test cases and added automatic nn.Graph tests that accompany Eager tests
- Provided graph.save() and load() interfaces to support the deployment of models (Serving) using nn.Graph
- To do AMP acceleration on GPUs which use TensorCore, the environment variable ONEFLOW_ENABLE_NHWC is provided to indicate the CNN-related operators for channels last calculation
- Enabled nn.Graph to support more usage scenarios:
  - Supported for Sparse Update Optimizer for sparse update of parameters in WDL scenarios
  - Supported for using the following nn.Module Containers with nn.Graph:
    Sequential, ModuleList, ModuleDict, ParameterList, and ParameterDict
  - Supported for creating Optimizer in the init function of nn.Graph
  - Supported multiple parameters sharing the same Tensor with nn.Graph
  - Supported for scenarios where the actual number of processes is greater than the number of GPU devices
  - Supported more Inplace execution for Consistent SBP inference under nn.Graph

6. A Larger Number of Operators

Newly added operators: cumsum, meshgrid, linspace, diagonal, movedim, roialign, nms, arccos, and roll
Newly added operators: masked_fill, floordiv, glu, pool1d, pool2d, and pool3d
Newly added unfold and fold operators: Adding Unfold and Fold Ops into OneFlow
Achieved automatic data type promotion of operators: [Automatic Type Promotion of Operators in OneFlow
Added expand and repeat operators: Added Expand and Repeat Operators into OneFlow
Supported one-click switching for the current torchvision library models by the command import oneflow as torch

7. User-Defined autograd.Function

Users can customize autograd.Function just like using Torch.

8. Added Basic Serving Functionality

Serving functionality of models is provided by OneFlow as Nvidia Triton's backend.

9. Added Some Functionalities of Tensor (ConsistentTensor)

Supported Tensor using 2-D SBP to represent arbitrary hybrid parallelism (such as a Linear operation that runs data parallelism in the row direction of the device matrix and model parallelism in the column)
Supported Tensor's conversion from arbitrary 1-D SBP to 2-D SBP (the network consists of a mixture of 1-D parallel and 2-D parallel)
Supported constructing ConsistentTensor from numpy
oneflow.from_numpy()
oneflow.numel()
tensor.expand_as()

Model

Released flowvision 0.0.54.

1. Richer Visual Pre-training Models

Image Classification

CNN series: ResNet, DenseNet, VGG, ResNext, EfficientNet, etc
Vision Transformer series: ViT, PVT, Swin-Transformer, etc
Vision MLP series: Mlp-Mixer, Res-MLP, g-MLP, etc

Object Detection

SSD, SSDLite
Faster R-CNN
RetinaNet

Image Segmentation

FCN
DeepLabV3

Style Migration

StyleNet: Suport Styles sketch, candy, mosaic, rain_princess, and undie

2. Implemented Data Augmentation Operations Similar to torchvision

For data augmentation operations like CenterCrop and ColorJitter similar to torvhvision, developers can run import flowvision as torchvisionto execute in most scenarios.

3. Implemented Advanced Data Augmentation Opertations Similar to timm

Advanced data augmentation opertations implemented in flowvision.data:

Mixup
CutMix
Random-Erasing
AutoAugment
RandAugment
AugMix

4. Separated the Layers Module and Provided a Plug-and-play Block when Building a Model

flowvision.layers.attention

Implemented plug-and-play attention models like Non-Local, SELayer, CBAM, BAM, ECA, etc

flowvision.layers.blocks

Provided modules that might be used for model building like PatchEmb, Pooler, ConvBnAct, etc

flowvision.layers.regularization

Provided regularization modules such as drop-path, drop-block, and stochastic depth to improve model generalization ability
Provided separate files such as activation and weight_init to improve components like activation function and initialize method

OneFlow-ONNX Conversion

Updated OneFlow to ONNX toolkit:

Supported OneFlow model converting to ONNX model in CPU or GPU mode
Added test cases for operators and models to align all classification models in OneFlowVision library
Fixed onnx-runtime bugs during PReLU conversion
Compatible with v1.9.0 onnx-runtime library or later versions
Released v0.5.4 oneflow-onnx package, and developers can run pip install oneflow-onnx to experience

oneflow - v0.5.0

Published by jackalcooper about 3 years ago

Changelog

v0.5.0 (8/10/2021)

Highlights

First class support for eager execution. The deprecated APIs are moved to oneflow.compatible.single_client
Drop-in replacement of import torch for existing Pytorch projects. You could test it by inter-changing import oneflow as torch and import torch as flow.
nn.Module for eager execution
nn.Graph for lazy execution
DDP for data parallel

A sneak peek of the new API

Here is a minimum example showcasing how to incorporate a nn.Module in a nn.Graph and have it run in lazy mode.

class NeuralGraph(flow.nn.Graph):
    def __init__(self, ...):
        super().__init__()
        self.model = model # model is a nn.Module instance

    def build(self, x):
        y_pred = self.model(x)
        return y_pred

graph = NeuralGraph() # to create a nn.Graph instance
y_pred = graph(x) # to run the created nn.Graph

New in Python API

[feature][eager][op][test][python][interface] Add test for convtranspose2d #5239
[enhancement][python][interface] Add GroupNorm #5175
[enhancement][eager][python][interface] [Add] avgpool1d avgpool3d #5165
[feature][eager][op][python][interface] Add deconv cpu impl #5224
[bug][eager][api][python][interface] Fix acosh bug #5221
[feature][eager][op][python][interface] Dev modules ctc loss #5168
[bottleneck][bug][documentation][python][interface] Fix meshgrid test bug #5208
[eager][documentation][python][interface] Rename CosineScheduler to CosineAnnealingLR #5112
[feature][eager][python][interface] Add meshgrid module #5205
[enhancement][feature][bug][op][python] support bias in conv2d's parameter list #5322
[eager][documentation][api][python][interface] add not_equal, greater_equal and less_equal module #5350
[enhancement][eager][python] refine pow module and its test #5319
[enhancement][eager][op][python] Add triu op #5329
[enhancement][bug][python] Fix optimizer for not supporting all kinds of iterables #5355
[bug][python][interface] raise IndexError in get_canonical_index to support for loop #5345
[bug][python][interface] tensor slice assign supports broadcasting #5344
[enhancement][op][python] add cpu group conv logic #5314
[enhancement][python] Add 'nn.Mish' module and corresponding functions #5310
[enhancement][build][python] Remove ONNX from setup py #5297
[enhancement][python][interface] [add] zeropad2d #5278
[feature][system][python][interface] Lazy nn.Graph FeedInputOpExpr #5458
[feature][python][interface] integrate nn.image.flip #5411
[bug][python] Fix issues in point of MultiClientSession #5469
[enhancement][bug][python] update HasAllMultiClientEnvVars() #5459
[enhancement][python] Add in_top_k function #5428
[enhancement][python] Dev add docstring #5449
[feature][api][python] MultiClientSession #5407
[documentation][python] remove --user #5431
[feature][python][interface] nn.Graph python #5309
[feature][python][interface] Fea/nn graph/graph name #5413
[bug][python][interface] rm nn.Graph.train #5424
[op][documentation][api][python][interface] add bernoulli module #5353
[enhancement][python] flow.S/B/P #5306
[enhancement][documentation][python] Add instruction on upgrade pip #5400
[enhancement][python] Rm oneflow export and experimental #5589
[bug][python] Fix nn.graph.utils module conflict #5598
[feature][ci][python] Update autotest framework #5520
[enhancement][python] copy of_proto_python_dir to compatible_single_client_python #5539
[enhancement][api][python] del default env init #5537
[enhancement][python] Fix single client using same glog file #5535
[bug][api][python] Fix Session TryClose #5531
[enhancement][feature][python] split vector-matrix norm #5478
[feature][eager][op][python][interface] Add more upsample kernel #5382
[enhancement][feature][test][python] add torchstyle unittest #5489
[feature][system][python] nn.Graph with training #5662
[enhancement][feature][python] Fea/nn graph/block proxy func #5727
[enhancement][api][python] consistent_tensor_to_api #5703
[feature][eager][op][python] Dev Align torch avgpool #5610
[enhancement][python] fix circular deps of sbp python module #5706
[documentation][python] [part5]Remove singleclient outdated api #5674
[enhancement][python] [part4]Remove singleclient outdated api #5672
[bug][op][python] remove outdated code in conv3d #5696
[enhancement][test][python] enlarge tolerance of dataloader test #5689
[enhancement][test][python] add autotest for some math ops #5646
[feature][python] nn.Graph optimizer part 2: add L2, pass job complete, refactor #5604
[enhancement][python] Add clip_grad_norm #5299
[purge][python] Remove Single-Client API in oneflow default python #5827
[bug][python] Fix ddp grad size #5834
[enhancement][feature][python] Dev RMSprop graph conf #5768
[enhancement][purge][eager][python] remove scale arg in optimizer #5821
[enhancement][feature][python] graph/block io check #5803
[enhancement][feature][python] Dev adam graph conf #5709
[purge][python] [part10]Remove singleclient outdated api #5756
[feature][api][python] better repr of nn.Graph for debug #5762
[bug][python] fix weight decay in RMSprop #5755
[purge][python] [part9]Remove singleclient outdated api #5752
[purge][python] [part8]Remove singleclient outdated api #5750
[documentation][python] add first batch of methods in oneflow.nn.functional namespace #5693
[purge][python] [part6]Remove singleclient outdated api #5704
[bug][python] use default_generator.seed() as random_seed in init #5721
[bug][system][python] ddp broadcast params and buffers #5913
[enhancement][test][python] Add consistent tensor requires grad test #5925
[bug][python] wrap flow.nn.init.* with flow.no_grad() #5932
[feature][api][python][interface] add clip_grad to optimizer #5817
[enhancement][ci][op][test][python] add randperm with test and docs #5680
[feature][api][python] Fea/nn graph/ lr_schedule(and cosine lr_sch) and opt_group #5846
[bug][python] fix bug of SyncOnMasterFn atexit #5909
[purge][python] Delete single client nn modules #6061
[enhancement][python] Move framework.distribute to env #6022
[bug][python] skip sync when abnormally exiting #6025
[feature][python] Fea/nn graph/warmup amp config #5969
[documentation][python] add optimizer api docs #6131
[documentation][python] add_tensor_api_doc #6127
[bug][python] Fix test_grid_sample.py and test_affine_grid.py threshold #6125
[documentation][api][python] add doc of graph #6093
[bug][python] Fix make of_format fail in ubuntu #6120
[feature][api][python][interface] Fea/graph helpers #6088
[enhancement][eager][python][interface] Use flow.randint in dataloader #6086
[feature][eager][api][python][interface] Import oneflow as torch #6076
[enhancement][test][api][python][refactor] rename OfrecordReader to OFRcordReader #6090
[purge][python][need-single-client-tests] Delete single client nn modules #6082
[enhancement][python] flow.load tolerates FileNotFound fault #6083
[feature][python] Fea/pipeline in graph #6105
[enhancement][test][python] graph activation checkpointing #6192
[enhancement][feature][op][python] rnn test #6165

New in Ops:

[enhancement][op][api][refactor] [Functional] Part2: Add partial unary and math functional apis #5218
[enhancement][bug][op][interface] Refine deconv kernel #5229
[enhancement][op][api][interface] add ReflectionPad2d #5172
[feature][eager][op][api][interface] crossentropyloss and nllloss support ignore_index #5195
[feature][eager][op][api][interface] Yejiaojiao/dev bcewithlogitsloss #5173
[bug][ci][op] Dev user op set default is_dynamic #5223
[enhancement][op] add magic method for pow #5199
[enhancement][op][interface] add cpu version of upsampling #5194
[enhancement][bug][op][api][interface] add ReplicationPad2d #5148
[feature][eager][op][api][interface] add kldivloss module #5155
[feature][eager][op][documentation][build][api][interface] Add floor module and the corresponding testcases #4964
[enhancement][feature][op] Dev conv1d module #5280
[enhancement][op] Add ctc_greedy_decoder op #5294
[enhancement][op][system] Dev remove default grad func #5320
[enhancement][op][system] Add pad grad func. #5354
[enhancement][op][system] Add gradient funcs. #5348
[feature][purge][bug][eager][op][interface] fix upsample nearest bug #5347
[enhancement][op][system] [Functional] Part7: Migrate pooling ops #5253
[enhancement][op] nvjpeg hardware acc #5240
[enhancement][feature][ci][eager][op][api][interface] Add bmm module #5334
[enhancement][eager][op] Dev image decode eager #5333
[enhancement][op] Optimize softmax warp impl #4977
[enhancement][eager][op] Dev tensor buffer eager #5317
[enhancement][op][api][refactor] [Functional] Part6: Migrate conv op #5252
[enhancement][eager][op] Dev sort eager #5284
[enhancement][bug][op][api] fix bceloss bug in default weight and reduction #5303
[bug][eager][op] remove redundant assert and check #5264
[enhancement][bug][ci][op] fix bceloss bug about weight #5269
[enhancement][op][api][refactor] [Functional] Part5: Migrate nn ops #5249
[enhancement][eager][op] Dev argsort eager #5273
[enhancement][op][api][refactor] [Functional] Part4: Migrate array ops #5247
[enhancement][op][api][refactor] [Functional] Part3: Migrate binary and activation ops #5246
[bug][ci][op][test] Dev fix rmsprop ci fail #5481
[enhancement][op] add inplace method: Tensor.sin_ #5471
[bug][op] hotfix image_batch_align #5461
[enhancement][eager][op][interface] Dev maxpool series op 123d #5244
[bug][op] fix pool gpu kernel #5446
[feature][eager][op][api][interface] add pixelshufflev2 module #5383
[enhancement][feature][ci][eager][op][documentation][api][interface] Add flow xxx and tensor xxx autotest #5386
[enhancement][feature][eager][op][api][interface] Modules chunk #5324
[enhancement][eager][op] add image normalize for eager #5402
[enhancement][eager][op] Dev batch align module #5401
[enhancement][eager][op] add coco reader module #5391
[enhancement][wip][op] Restruct Elementwise kernel #4130
[bug][op] Fix DecodeRandom reuse mem #5606
[enhancement][op] Align pytorch maxpool #5525
[enhancement][bottleneck][eager][op][api] implementation of constantpad-3d op #5529
[enhancement][eager][op] Add scale size for resize #5509
[enhancement][op][api][refactor] Dev optimize tensor setitem #5501
[enhancement][op] register uint8 dtypeto support dataloader #5499
[enhancement][op] Add unique.cuh #5487
[enhancement][op][api][interface] Dev ofrecord auto truncating #5412
[feature][op][system][interface] Feat: LazyInterpret::ApplyImpl support SourceUserOpExpr and Copy #5711
[enhancement][op][interface] Dev logical_and/or modules #5636
[enhancement][op] support any number positional arguments for ones and zeros op #5698
[enhancement][feature][eager][op] Add conv3d Module #5327
[feature][eager][op][api][interface] add batchnorm3d module #5631
[bug][eager][op] fix reduce min max backward bug #5651
[enhancement][op] Debug dim scatter #5371
[enhancement][op][interface] Dev eye #5583
[enhancement][eager][op] Dev minimum maximum #5576
[enhancement][op] Restruct activation grad op #5669
[enhancement][feature][eager][op] Rewrite activation function #5465
[bug][op][documentation] add oneflow.cat for documentation #5621
[enhancement][op] Lcy logsoftmax #5746
[feature][op][need-simple-ci] Feat empty op #5659
[enhancement][eager][op] Dev split #5714
[enhancement][op][interface] add index_select op #5661
[bug][op] fix nvjpeg hw acc #5851
[enhancement][op] Remove move in conv_cudnn #5828
[enhancement][op][interface] Dev logical_xor module #5694
[bug][eager][op] fix squeeze #5808
[enhancement][op] Get parallel_id and parallel_num through rank and world size in DDP #5717
[bug][eager][op] delete interpolate int type #5805
[bug][op] Fix bug in scatter #5743
[enhancement][op] Refactor: remove module not required, call function directly #5754
[enhancement][op] Remove modules not required(tan, erfc, log1p, scatter_nd) #5791
[enhancement][op] Refactor scatter, clamp and pow in cpp instead of in python #5715
[enhancement][op] Rm useless code in gather files #5687
[enhancement][eager][op] change flip_code to scalar #5786
[enhancement][bug][op][interface] fix upsample bug #5753
[bug][op][interface] Quick fix Lazy nn.Graph input/output OpConf.BlobConf.is_dynamic #5767
[enhancement][bug][eager][op] fix argwhere 0-dim bug #5760
[enhancement][eager][op] delete unused code #5744
[feature][op] Export fused_scale_tril op #5933
[bug][op] Fix backward bug in 3d #5908
[bug][op] Fix one_hot api limit #5927
[enhancement][eager][op] Dev where scalar #5797
[bug][op] fix grad error #5914
[feature][bug][op] Fix inplace op circle reference bug #5910
[enhancement][op] Move the judgment content to c++， And add scalar fmod #5854
[enhancement][op] Support combined_margin_loss op in flow.nn.modules #5830
[enhancement][op][api][interface] functional_one_hot #5315
[enhancement][op] Dev scalar op #5778
[bug][eager][op] fix gather kernel 0 shape #5888
[enhancement][op] add l2_normalize for mutl-client interfaces #5859
[feature][op] Export function softmax_cross_entropy #6056
[enhancement][op] Add int attr for functional adaptive average pool #6059
[enhancement][op][interface] dev full op #5955
[bug][eager][op] fix 0dim inplace add #6029
[feature][op][system][interface] Feat: nn.Graph image gpu decoder #6014
[enhancement][op][interface] dev optim_optim_lr_scheduler_multisteplr #5975
[enhancement][op] NopKernel #6035
[enhancement][eager][op][api] Dev tril op #6005
[enhancement][op] dev unfold and fold #5675
[enhancement][op] ResNet CUDA Graphs #6018
[enhancement][feature][op] add broadcast pow #6013
[enhancement][op][interface] init of op diag #5298
[op][documentation][api] Fix api document bug #6009
[enhancement][op] Dev fused functional #5954
[bug][op][build] Add nvcc flag -Werror cross-execution-space-call #6002
[bug][op] Fix Normalization grad function #5993
[enhancement][feature][eager][op][test][interface] Add fused self attention #5966
[enhancement][bug][ci][eager][op][api][interface] Try to fix var bug #5973
[enhancement][feature][eager][op][interface] add prod op #5867
[enhancement][eager][op][api] add glu op #6065
[enhancement][op] Align Torch.nn.functional poolXd #6184
[bug][eager][op] fix backward index for gamma beta #6149
[bug][op][system] Fix BroadcastMatmulGrad bug #6168
[enhancement][op][api] Add Int support for functional.avg/maxpool #6174
[bug][eager][op][api][interface] align dropout api name with pytorch #6170
[enhancement][op] support inplace operation for hardsigmoid #6137
[enhancement][bug][op] Fix do bias correction in Adam/AdamW #5960
[bug][eager][op][api][interface] fix repeat 0-dim tensor bug #6150
[enhancement][bug][op] Fix select_first_grad bug #6142
[bug][ci][eager][op][documentation][interface] Add clipgrad doc and contiguous #6130
[bug][op] Fix eager optim dynamic attr bug #6111
[enhancement][op] Support grid_sample and affine_grid operator #6038
[op][documentation] Export apis for documentation #6068
[enhancement][feature][bug][ci][eager][op][documentation][interface] transfer python function to c++ method #6114
[op][documentation] Dev functional batch_gather #6233
[enhancement][op][test] fix cross_entropy_loss and its test #5799
[bug][op] Use attr nd_sbp to check consistent #6222
[enhancement][op] Dev fused bn functional #6077
[enhancement][op] support default value in intlist #6201
[bug][op] fix sparse_softmax get_nd_sbp #6203
[bug][op] Fix bug in model fused update #6197
[enhancement][op][system][refactor] Optimize tensor getitem. #5433

New in Eager:

[enhancement][eager][interface] Reconstruct module files #5251
[bug][eager][documentation][interface] Fix conv module bug #5245
[bug][ci][eager][interface] Fix bce withlogitloss ci error #5237
[feature][eager][api][interface] module BCELoss #5144
[enhancement][feature][eager][api][interface] Dev norm op #5178
[enhancement][bug][eager] Fix stack module #5222
[enhancement][feature][eager] Support different dtype of equal module #5214
[enhancement][bug][eager][documentation][api][interface] Add nllloss backward #5210
[enhancement][eager][api][upload-core] Decouple FileSystem and IOConf #5162
[enhancement][ci][eager] Set lower precision avoid ci failing #5200
[eager][documentation] Add hint when apply FunctionNode second time #5369
[enhancement][feature][bug][ci][eager][documentation][api] Fix upsample bilinear bug #5366
[bug][eager] Fix not contiguous ndarray to tensor bug #5351
[enhancement][eager][system] Infer consistent tensor meta #5118
[feature][eager] Feat graph autograd engine #5296
[enhancement][eager][interface] Dev type as module #5349
[feature][eager][documentation][api][interface] Add new ones module #5342
[enhancement][bug][eager] Fix logical slice assign dtype #5339
[bug][ci][eager][documentation][api][interface] Fix where module bug #5300
[bug][ci][eager][documentation][api] Fix l1loss ci error #5307
[enhancement][bug][eager][documentation][api][interface] Qi's First Edit of deleting "print" and ".numpy" #5129
[feature][eager][refactor] Separate autograd meta to tensor #5267
[feature][eager][api][interface] add tile module #5234
[enhancement][eager] Release lambda function to reuse tensor memory #5266
[feature][bug][eager][documentation] Fix default value not set bug #5483
[enhancement][eager][interface] [Add] gather_nd scatter_nd #5422
[enhancement][bug][eager] fix param #5473
[bug][eager] Fix Tensor.grad setter bug #5462
[enhancement][eager] Rename now_grad_arg to current_grad #5466
[eager][test][documentation][interface] Add autotest part1 #5436
[enhancement][eager] Use functional copy instead of op_builder #5460
[bottleneck][bug][eager][interface] fix -1 index not support bug #5448
[bug][ci][eager][documentation][api] Fix concat backward bug #5443
[enhancement][bug][ci][eager] Add autograd engine warning #5444
[feature][eager][api][interface] Smoothl1loss #5256
[enhancement][bottleneck][eager] remove device dtype params #5434
[bug][ci][eager][documentation][interface] Delete maxpool failed test #5409
[enhancement][eager][api] Add tensor grad assginment #5379
[enhancement][bug][eager] fix-abs #5398
[enhancement][bug][eager][interface] Fix bn track running stats #5393
[enhancement][bug][eager] Support uint dtype of constant op #5396
[enhancement][bug][eager][documentation][interface] Delete useless code upsample #5392
[enhancement][ci][eager][interface] add flow.view #5301
[enhancement][bug][ci][eager][api][interface] Add masked select module #5356
[bug][eager][interface] Fix batchnorm backward bug #5602
[enhancement][eager] Support weight_dacay(l2 actually) #5587
[feature][eager][documentation][api] Add new autotest #5588
[enhancement][eager][documentation][api] Dev fmod #5404
[feature][eager] Support inplace add #5432
[feature][eager][interface] Feat tensor stride property #5543
[enhancement][feature][eager][documentation][api] Add flip module #5541
[feature][eager] Feat module repr #5486
[enhancement][bottleneck][bug][eager][interface] Fix maxpool1d params #5493
[enhancement][feature][eager][interface] Dev flow.utils.data part1 #5406
[bug][eager][api] Fix tensor getitem bug #5474
[enhancement][eager][need-simple-ci] export datasets interface #5691
[enhancement][eager][system] rebase #5601
[enhancement][eager][test] added nn.RecordBytesDecoder with its test #5475
[enhancement][feature][eager][need-simple-ci] 0-dim tensor support #5552
[enhancement][bug][eager] rewrite slice_update backward #5677
[enhancement][bug][eager][interface] align view input style with torch #5676
[enhancement][eager][interface][need-simple-ci] add autotests for modules #5666
[enhancement][bottleneck][eager][interface] Dev constantpad1d op #5579
[enhancement][eager][api][interface] Restruct MathOps AutoTest #5654
[enhancement][bug][ci][eager] Fix flip bug #5657
[bug][eager][api][interface] Fix expand module bug #5650
[enhancement][bug][eager][documentation][api] Fix repeat bug #5633
[enhancement][eager][test][api][interface] Add new autotest #5617
[enhancement][eager][api][interface] Dev flow.utils.data part2 #5500
[enhancement][bug][eager] make setitem device match #5835
[bug][eager][api][interface] align reshape input param with pytorch #5804
[feature][bug][eager][api] Align where op with torch #5850
[enhancement][bug][eager][api] Restruct prelu op #5829
[bug][eager][need-simple-ci] fix pooling ceil_mode bug #5818
[enhancement][eager] stateful local kernel supports consistent #5789
[bug][eager][api][interface] Fix argwhere bug #5816
[enhancement][eager][documentation][api] dev-nonzero #5809
[enhancement][feature][eager][api] Add fake quantize op #5690
[enhancement][bug][eager][documentation][api] Add api #5663
[enhancement][eager] Refactor consistent infer result #5790
[bug][eager][need-simple-ci] skip dataloader test #5780
[bug][eager][need-simple-ci] fix 0-dim tensor.fill_ #5771
[enhancement][eager] Cpu mpi broadcast #5726
[feature][eager] Feat grad mode classes #5956
[enhancement][bug][eager] fix wrong names #5951
[enhancement][eager][system] Local dep object pool #5953
[enhancement][eager][interface] rename OpExprInterpState to AutoGradCaptureState #5918
[bug][eager] Fix linear bug #5945
[bug][eager] Fix tensor_meta update bug #5924
[enhancement][eager] use flow.randperm #5928
[enhancement][eager] consistent init/save/load #5896
[enhancement][bug][eager][documentation][interface] Restruct sort and argsort op #5911
[enhancement][bug][eager][interface] Try to fix the problem that the insightface cannot converge。 #5906
[enhancement][bug][eager][interface] Add autotest #5899
[enhancement][eager] The scheduler thread joins worker threads #5893
[enhancement][eager] Bugfix async callback #5881
[feature][eager] Feat tensor to bool #5836
[bug][eager] Remove inplace broadcast_add #5551
[enhancement][eager] Broadcast consistent shape and dtype #5784
[enhancement][eager] Fix optimizer list parameters input bug #5848
[enhancement][eager][interface] Dev flow.utils.data part3 #5644
[enhancement][eager][api] Normalize naming of modules #6066
[enhancement][feature][eager][api][interface] add truncnormal #6051
[enhancement][bug][eager] AutoMatedTest support test module.parameter.grad #6043
[enhancement][feature][bug][eager] add module call kwags #6069
[enhancement][eager][api][interface] add tensor.item tensor.tolist #6021
[enhancement][eager][api][interface] Export pool ops api #6047
[enhancement][bug][eager][test][documentation][interface] Add more autotest sample #6039
[enhancement][bug][eager][system] disable cuda_h2d stream #6020
[feature][eager][test][api][interface] Add autotest codegen #6019
[feature][eager][documentation] Refactor cosine lr scheduler #6000
[enhancement][eager][interface] tensor.cpu/tensor.cuda #5894
[enhancement][eager][api] Support consistent_tensor.to(dtype) #5991
[bug][eager][interface] remove redundant codes in ModuleDict #5961
[bug][eager] Fix LayerNorm check bug #6196
[enhancement][eager][api] Change dropout api #6182
[enhancement][good for pr][eager][api][interface] add: test convert dependency #6023
[enhancement][bug][eager][interface] Fix autotest codegen bug #6171
[bug][eager] restore instr_local_dep_object_pool_size for nccl #6160
[enhancement][eager][api][interface] Aligin pooling op functional api names with torch #6163
[feature][bug][eager][api][interface] delete file #6162
[bug][eager] Fix optim load_state_dict bug #6152
[enhancement][eager][api] add is_training to dropout functor #6148
[enhancement][eager] Decompose nd sbp boxing #5800
[enhancement][eager] support consistent_tensor.to(copy=True) #6122
[feature][eager] Static grad scaler #6135
[bug][eager] Fix LayerNorm expr bug #6121
[bug][eager][api] move numpy c api init in numpy.cpp, make np array contiguous before copying #6117
[enhancement][eager][refactor] Remove params from ParamGroup getitem #6096
[enhancement][feature][eager] Support tensor and optimizer serialization #6087
[enhancement][bug][eager] fix bug about tensor str in nonsymmetric cast and getitem in consist… #6239
[enhancement][eager] Cpu all reduce #5849
[feature][eager] Support assign copy interface #6228
[enhancement][eager][api][interface] Dev reconstruct pad ops #6223
[enhancement][eager][api][interface] support flow.cuda.is_available #6124
[bug][eager] make flow._C.local_all_reduce sync lanuched #6175
[enhancement][eager] Rename flow to oneflow in user hint #6190
[bug][eager][tooling][test][api][interface] Autotest generate input tensor #6206
[enhancement][eager] consistent tensor zeros_() #6202
[enhancement][eager] Cpu mpi #5865

Build enhancements:

[bug][build] Fix GRPC compilation failure on CMake 3.20 #5255
[bug][build] Refine header file copy #5254
[bug][build] Fix older version CMake doesn't support multiple targets in CLI #5248
[bug][build] Turn off NCCL_STATIC/CUDNN_STATIC when CUDA_STATIC is OFF #5243
[feature][build] Fix support for Ninja and add Ninja build in Simple CI #5236
[enhancement][build] Add cmake option CUDA_STATIC #5164
[bug][build] Fix protobuf debug postfix #5233
[enhancement][ci][build] Move default third party dir into build dir #5230
[enhancement][build] Refine protobuf cmake #5216
[enhancement][ci][build] Remove transport test main #5215
[enhancement][ci][build] Speedup opencv build #5213
[enhancement][build] Support clang #5015
[enhancement][documentation][build] Add prefix when creating git archive #5201
[enhancement][build] Add cmake option NCCL_STATIC #5160
[enhancement][build] Refine CMake CUDA version handling #5192
[enhancement][build] Use clang plugin to check Maybe variables are used #5358
[enhancement][build] Add BUILD_BYPRODUCTS for ExternalProject_Add #5316
[enhancement][build] Add cmake init cache to simplify user onboarding #5311
[feature][bug][build] Fix macOS support and run macOS build in Simple CI #4947
[enhancement][build] flatbuffers use mirror #5295
[enhancement][build] Don't build test by default #5302
[enhancement][build] Prevent building from scratch when toggle flag BUILD_GIT_VERSION #5259
[enhancement][build] Refine gRPC, glog, gflags cmake for conda #5276
[feature][build] Support XLA with CPU-only #5260
[enhancement][ci][onnx][build] Remove ONNX from CI #5257
[enhancement][build] Refactor build_wheel to support oneflowinc images #5427
[enhancement][build] Add arg skip_audit in build wheel #5423
[bug][build] hwloc disable shared #5388
[documentation][build] Update readme for autoconf and libtool #5376
[enhancement][build] remove dir python and compatible_single_client_python #5609
[bug][build][system] Fix pyyaml version #5594
[enhancement][ci][build] force release flags #5574
[bug][build] prevent endless loop #5534
[enhancement][build] Support sccache #5528
[enhancement][build] Add definition for CMAKE_BUILD_TYPE and print cmake_build_type in oneflow doctor #5505
[enhancement][ci][build][need-simple-ci] Fix macOS for recent changes #5705
[bug][build] fix return type error on gcc 4.8.5 #5660
[enhancement][build] Check CMAKE_BUILD_TYPE #5656
[enhancement][build] add -Werror=return-type #5655
[enhancement][build] Clean and fix for new py dir #5618
[enhancement][build] cmake: disable array-bounds check & treat warnings as errors for pyextobj and oneflow_internal & fix warnings #5838
[bug][build] set CMAKE_BUILD_TYPE to Release if undefined #5842
[enhancement][build][need-simple-ci] Fix all warnings & Add option TREAT_WARING_AS_ERROR to cmake #5751
[enhancement][build] add CMAKE_INTERPROCEDURAL_OPTIMIZATION in fast cmake cache #5970
[enhancement][build] add clang tidy target #5957
[bug][build] cmake: fix cmake cache args in opencv #5959
[enhancement][build] Add cmake option USE_SYSTEM_NCCL #5897
[enhancement][build] cmake: include third party headers as system headers to avoid warnings #5879
[enhancement][build] Ignore opencv-python on machine aarch64 #5884
[enhancement][build] enable CMake first class cuda support #5858
[bug][build] Fix compile warning (strict-aliasing) #5872
[enhancement][bug][build][need-simple-ci] Upgrade gtest and fix some errors raised by clang #6079
[bug][ci][build] cmake: fix ninja build in CI #6072
[bug][build] fix files not actually removed when building for multiple python versions #6060
[bug][build][api] functional_api: fix build error in mac os #6010
[bug][build][need-simple-ci][need-single-client-tests] Fix recompile from scratch #6036
[bug][build] Turn on NVCC's warnings #6011
[bug][build][need-single-client-tests] fix bundle .so of other python version #6034
[bug][ci][build][need-single-client-tests] use copy_all_files_in_dir to replace copy_files #6033
[enhancement][build] check compiler version in cmake #6026
[enhancement][build] Add CUDA_NVCC_THREADS_NUMBER #6017
[enhancement][build][need-simple-ci] optimize of_include_copy #5978
[enhancement][ci][build][need-single-client-tests] CI: remove -DTREAT_WARNINGS_AS_ERRORS=OFF #6008
[enhancement][build][xla] xrt: fix all warnings #5915
[enhancement][build] Prevent opencv compile failure with std 17 #5997
[enhancement][build] Use bundled cub #5998
[enhancement][ci][build] update clang tidy diff warnings-as-errors option #5989
[enhancement][build] Update run_clang_tidy.py to set return code and add warning-as-errors #5977
[enhancement][build] check: fix clang-tidy-diff commands #5972
[bug][build] Suppress NVCC warning #177-D #6094

XLA enhancements:

[bug][xla] Make the blob header memory aligned. #5286

System:

[enhancement][system] Refactor Memory Zone #5072
[enhancement][system] Add interface InferContext::OutputTensorDesc #5219
[bug][system] Lazy construct functor to make sure that the operators has already been registered. #5225
[enhancement][system] Refactor infer ctx output isdynamic #5220
[enhancement][system] Refactor infer ctx input isdynamic #5211
[enhancement][system] Wake up the heartbeat thread immediately #5081
[enhancement][system] Fix xla test case fail #5203
[enhancement][system] Add interface InferContext::InputDType #5153
[purge][system] delete const_cast in Output #5196
[feature][system] Add hwloc for topology detection #5291
[enhancement][system] fix registry may segment #5336
[enhancement][system] Use functional api instead of op_expr_helper::XXXOp. #5364
[enhancement][system] move btob to op #5274
[documentation][system] Add Latest News section in README #5361
[enhancement][bug][system] fix dropout module: return directly if not training #5346
[bug][system] add missing JUST #5357
[documentation][system] Add more communication outlets on README #5359
[enhancement][feature][system] CommNet dynamic register memory #5281
[enhancement][system] Use symbol device #5341
[enhancement][system] fix multithread bug in env #5283
[bug][system][api] fix bug in cfg_replacement #5335
[bug][system] Fix create log directory thread-unsafe #5326
[bug][system] fix_bug_in_make_parallel #5328
[enhancement][system][cfg] replace train_conf, job_conf using cfg::xx #5263
[enhancement][system][quantization] support tensorrt in qat #5287
[enhancement][system][api] Export functional apis for oneflow.experimental. #5313
[enhancement][system] fix bug check between cfg enum and proto enum #5285
[enhancement][system] replace CHECK_EQ using CHECK_EQ_OR_RETURN #5279
[enhancement][system] Refactor SbpXXX to cfg::SbpXXX #5120
[enhancement][system][api] add detach for LazyMirroredtensorImpl #5270
[enhancement][system] shorten XXIsDynamic4ArgNameAndIndex to be xxIsDynamic #5265
[enhancement][system][cfg] job_config to cfg #5235
[feature][system] Multi-Client LogicalRun degenerate to PhysicalRun #5479
[enhancement][system] fix ConstructOp without JUST #5480
[enhancement][system] Output arg modifier return maybe part 1 #5451
[feature][system][interface] Fea/nn graph/graph build ctx #5420
[enhancement][system] Throw exception if check failed #5457
[feature][system] multi client launch #5372
[enhancement][system][api] Optimize reduce mean #5452
[enhancement][system] export Tensor only to python #5440
[enhancement][system] Output arg modifier return maybe part_0 #5447
[enhancement][system] ThreadMgr support AddPlan #5450
[enhancement][system] Refactor infer ctx input tensordesc #5226
[enhancement][system][api] instruction builder return maybe #5442
[feature][system][interface] MultiClientSessionContext #5421
[enhancement][feature][system] add launcher, update multi client launch and exit #5414
[purge][system][refactor] Remove IOConf #5419
[enhancement][system] Dev refine generator #5426
[enhancement][system] Support inplace operations #5204
[enhancement][system][refactor] Dev refactor generator #5397
[enhancement][system] Add new placement init func #5408
[enhancement][system] NNGraphIf #5387
[enhancement][system][refactor] Cast explicitily in unpack call to avoid confilt with Optional. #5380
[enhancement][system][interface] [Random Generator] Part2: Migrate functional dropout #5378
[enhancement][system] replace ForeignJobInstance using JobInstance #5374
[enhancement][system][refactor] Speedup reshape module by 5x. #5381
[feature][system][interface] [Random Generator] Part1: Dev random generator #5360
[enhancement][system] Add ONEFLOW_STREAM_CUDA_EVENT_FLAG_BLOCKING_SYNC #5612
[enhancement][system] [part2]Remove singleclient outdated api #5568
[feature][system][interface] nn.Graph call and launch impl #5580
[enhancement][system] remove outdated doctest api and "@experimental_api" #5564
[feature][system][interface] Register ForeignCallback and Watcher in Multi-Client #5591
[enhancement][system] [Part-1]remove outdated api and files of multi-client on master branch #5556
[feature][system][interface] LazyInterpret build LocalTensor if input is local #5582
[enhancement][system] add job_pass MultiClientAutoSourceAndSinkTick #5507
[feature][system] Fea/nn graph/optimizer #5533
[feature][system][interface] New/CloseRuntimeBuffers and RunLazyJob impl #5571
[feature][system][refactor][interface] NNGraph interface and implement for CompileAndRuntime #5558
[feature][system] Fea/nn graph/forward graph #5516
[enhancement][system] Lazy job stream type #5389
[enhancement][system] Refactor single client autotick #5506
[enhancement][system] replace underline using dot in single client #5547
[bug][system] fix return type #5548
[feature][system][interface] LazyInterpret for UserOpExpr #5544
[enhancement][system] Add ProfilerStart/ProfilerStop API #5542
[feature][system][interface] LazyInterpreter for FetchOutputOpExpr and set op parallel_distribution #5527
[enhancement][system] Multi client push pull #5492
[enhancement][system] registry_callback_fn return maybe #5456
[enhancement][system] bw_gen_fn return maybe #5455
[enhancement][system] gen_bw_fn return maybe #5454
[enhancement][system] Compatible single client #5417
[feature][system][interface] GlobalMultiClientEnv and refine EagerExecution #5523
[enhancement][system] Job pass maybe system #5503
[enhancement][system] Remove Plan::net_topo #5502
[feature][system][interface] LazyInterpret for FeedVariableOpExpr #5490
[enhancement][system] Input arg modifier return maybe #5453
[feature][system][interface] Fea/nn graph/block scope #5498
[feature][system] jit_fuse_cast_scale #5332
[enhancement][system] Remove obsolete Profiler #5747
[enhancement][system][api] Dev fix batch norm not stats #5733
[enhancement][system] rename rpc_token to TransportToken #5735
[enhancement][system][api] Refacotr maximum minimum py2cpp #5724
[enhancement][system] Replace piece_id with comm_net_sequence_number #5731
[enhancement][system] beautify stack frame #5686
[enhancement][system] Add env ONEFLOW_KERNEL_DISABLE_BLOB_ACCESS_CHECKER #5728
[enhancement][system] Add env ONEFLOW_THREAD_ENABLE_LOCAL_MESSAGE_QUEUE #5720
[enhancement][system][api][refactor] Refactor functional sub, mul and div apis #5713
[feature][system] ddp #5008
[enhancement][system][api][refactor] Refactor functional matmul and add apis. #5697
[bug][system] Fix ClearKV("plan") #5710
[enhancement][system] Rename cpu to async cpu #5712
[enhancement][system] Support tensor.to()/to_local() #5271
[feature][system][refactor][interface] Multi-Runtime for multi nn.Graph #5683
[bug][system][refactor] Add tag for Optional inplace constructor #5619
[enhancement][system] Move Global to env scope #5670
[enhancement][system] add JUST wrapper #5681
[enhancement][system] New sync consistent meta info #5634
[enhancement][system][refactor][interface] Refactor RuntimeCtx for multi-runtime #5664
[feature][system][interface] Feat: memory shared between EagerTensor with VariableRegst #5649
[enhancement][system] Use functional call directly instead of construct a module and then call-Add #5613
[enhancement][system] disable eager_op consistent mode #5647
[enhancement][system] add msg_penddin_list in ibverbs_qp to optimize qp_init_attr.cap.max_send_wr #5485
[enhancement][system] IBVerbsCommNet add knobs #5626
[enhancement][system] Prune python tensor #5596
[feature][system][interface] Feat: LazyInterpret infer op / tensor ParallelDescScope #5625
[enhancement][system] Replace src tick with with wait and send ids #5603
[enhancement][system] Support symbol placement type in functional. #5627
[enhancement][system][api][refactor][interface] Dev advanced indexing #5559
[enhancement][system] Optimize maybe. #5839
[enhancement][system] Decorator 4 disable recursive boxing call #5796
[enhancement][system] add_eager_boxing_and_op_interpreter_dispatch_error_info #5819
[enhancement][system] Kernel CUDA Graphs Support #5725
[bug][system] Fix placement print bug #5853
[bug][system] when error msg formatting fails, return error->DebugString #5844
[enhancement][system][refactor] Rename variables named *parallel_distribution* to *nd_sbp* (1) #5815
[feature][system][interface] Support Free EagerTensor caught in nn.Graph build #5777
[enhancement][system] Reuse CUDA event / Refine BnInOp2Blob / Refine channel #5837
[enhancement][system][serving] fix bug in AddInputOutputOpsPass: check existence of key in HashMap(inferface_lbi2scope_sym_id) #5653
[enhancement][system][api] unpack_call: impl new unpack_call_dispatcher for better performance #5820
[feature][system] Feat consistent tensor python constructor #5812
[feature][system] Support 0shape tensor #5620
[documentation][system] fix launcher description #5770
[feature][system][interface] Multi-nn.Graph memory reuse by Chunk manager #5658
[bug][system] Fix naive b2p error #5806
[enhancement][system] set created generator with default rng seed #5801
[enhancement][system] enhance_local_to_consistent #5761
[feature][system] add flow.randn #5736
[enhancement][system] Refactor hierarchical parallel cast autograd #5764
[enhancement][system] Collective boxing executor add_plan delete_plan #5495
[enhancement][system] Fix throw abort #5795
[enhancement][system] DECORATE #5794
[enhancement][system] Inferface eager boxing #5682
[enhancement][system] extract_consistent_to_consistent_op_expr #5870
[enhancement][system] disable backward pass consistent tensor meta check. #5871
[enhancement][system] Add CudaStreamIndexGenerator::GenerateNamedStreamIndex #5940
[bug][system] Only query PCI bus id when CUDA version >= 11 #5937
[enhancement][system] maybe: add JUST_MSG and CHECK_JUST_MSG #5904
[bug][system] Fix bug scalar #5950
[enhancement][system] framework: fix rvalue reference warnings #5948
[purge][system] Remove CudaWorkType #5942
[enhancement][system] refactor_symbol #5941
[bug][system] consistent_tensor_infer_cache: fix memory leak #5938
[feature][system] support to print gpu #5936
[enhancement][system] Bugfix static check #5935
[bug][system] fix nccl_version log #5934
[bug][system] Fix bug of multi-GPU train nn.Graph extra mem cost in rank 0 #5930
[enhancement][system] Only gradient acc be scheduled in parallel. #5926
[enhancement][bug][system] fix_ddp_bug_on_8_process #5929
[enhancement][system] Fix bug error msg format #5866
[feature][system] print consistent tensor data #5902
[bug][system] Move parse env to the constructor #5922
[enhancement][system] Remove GlobalWorkStreamId/GlobalThrdId #5917
[bug][system] shared_or_scalar: fix alias warnings #5916
[purge][system] Remove CompActor #5919
[enhancement][system] Use symbol dtype #5641
[enhancement][feature][system] Control Graph / Session / Env's python c++ object destruction #5845
[enhancement][bug][system] Sync access and assign indexing tensor. #5907
[enhancement][system][api][refactor] Dev consistent arange #5883
[enhancement][system] Lazy interpreter for new ConsistentToConsistentOpExpr #5903
[bug][system] Fix BUG of LazyInterpret FreeEagerTensor memory shared with regst #5891
[bug][system] fix typo in raise RuntimeError #5890
[enhancement][system][refactor] Rename the ParallelDistribution class to NdSbp #5814
[feature][system] add flow.rand #5722
[feature][system] Lazy Interpret support infer default device cpu #5880
[enhancement][system] Tensor str #5783
[feature][system][interface] Lazy to_consistent #5774
[enhancement][system] wait vm empty before exiting #5860
[enhancement][system] Eager boxing n to 1 #5949
[enhancement][system] add kernel observer #6052
[enhancement][ci][system] Optimize ddp broadcast and add speed/memory test in ci #6044
[enhancement][system] add var to control only print warning once when blocked #6045
[enhancement][system][refactor] Rewrite pow and logical functional apis #6032
[enhancement][system] Token seq id #5964
[enhancement][documentation][system] Remove python function wrapper. #6012
[feature][system] Add timeout and loc for blocking calls #6007
[enhancement][system] Eager boxing 1 to n #5943
[enhancement][system] Boxing expr #6015
[enhancement][system] new_X_to_B #5987
[enhancement][system] Add unimplemented return information #5952
[enhancement][system] Revert "Faster decorator" #6006
[enhancement][system] Throw exception if using advanced indexing for tensor setitem #6001
[enhancement][system] Support eager boxing sm 2 sn #5869
[enhancement][system] Move framework/local_dep_object.* to the eager directory #5988
[enhancement][system] Fix builtin op arg tuple. #5464
[feature][system][refactor] Dev functional multiple signatures #5982
[enhancement][system] Faster decorator #5996
[enhancement][system] Placed nd sbp #5995
[feature][system] Support asymmetric input/output/variable tensors in nn.Graph #5983
[enhancement][system] LightActor #5868
[bug][system] Prevent running oneflow in forked subprocess #5976
[bug][system] common/error: fix build error in mac os #5971
[bug][system] fix_bug_test_tensor_str #5958
[enhancement][system] Refine StreamContext #6191
[enhancement][system] container_util: fix VectorAt, remove useless MutMapAt #6172
[enhancement][system] Typesafe KernelState #6198
[enhancement][system] Primitive based copy task node #6195
[feature][system][interface] Lazy support Scalar #6181
[enhancement][system] Disable implicit boxing when parallel num eq one #6188
[enhancement][system] Primitive #6183
[enhancement][system] Remove IDMgr::GetGpuPhyIdFromThrdId/IDMgr::GetDeviceTypeFromThrdId #6169
[enhancement][system] remove op_expr_helper inside gradient_funcs #6057
[feature][system][api] Add tensor yaml, support export tensor functional api. #6099
[feature][system] Plan memory log #6151
[feature][system] Add dtype bfloat16 #5304
[enhancement][system] StreamContext #6129
[bug][system] Fix wrong inplace acc grad #6146
[enhancement][system] UserKernel remove job_desc #6144
[enhancement][system][api] Fea/graph/add outputs buffer to enable pipeline #6126
[enhancement][system] not fuse request for nccl 2.10.3 #6136
[bug][system] NewUniqueId thread safe #6141
[enhancement][system] XRT remove job_desc #6139
[enhancement][system] SystemOpFillJobNamePass #6138
[enhancement][system] mv_boxing_folder_to_core #6140
[enhancement][system] Refactor boxing interpreter to boxing expr #6134
[enhancement][system] Eager boxing one to one #6048
[enhancement][system] Vm cpu efficiency #6110
[enhancement][system] Naive generic boxing #6116
[feature][system] send/recv #5992
[enhancement][system] disable_print_stack_in_tensor_numpy #6123
[feature][system] add all_reduce by to_consistent #5963
[enhancement][system] KernelContext #6084
[enhancement][bug][system] Fix sync nccl and async nccl deadlock #6071
[bug][system][refactor] Refactor to local #6098
[enhancement][system] Replace xor with hash combine (part 1) #6078
[enhancement][system] Optimize error message #6073
[enhancement][system] Rename Error::xx to Error::xxError #6049
[enhancement][system] send formatted msg to glog #5999
[feature][bottleneck][bug][system][interface] [Feat.] NNGraph new eager tensor for new variable created in JobPass #6091
[bug][system] Fix bug of multi-GPU eager copy D2H extra mem cost in rank 0 #6092
[enhancement][system][api] Rename module flow.F to flow._C #6053
[feature][system][interface] [Feat.] Eager consistent OFRecordReader #6089
[enhancement][system][api] Dev fix and align interface #6075
[feature][bottleneck][bug][system][interface] NNGraph input/output valid by register tensors #6240
[bug][system][interface] Fix bug of Multi-Client src tick output order #6221
[enhancement][bug][system] Add cast primitive #6234
[feature][bottleneck][system][interface] Auto FixPipelineStageIdPass #6204
[enhancement][system] move scalar to oneflow namespace. #6235
[enhancement][system] UserKernel init CUDA Graphs with state #6230
[feature][system] Comm broadcast #6213
[enhancement][system][refactor] Rename opname to optype_name in AutogradEngine #6154
[enhancement][system] Add memset primitive #6218
[enhancement][system] Add StreamContext::device_type()/DeviceCtx::device_type() #6217
[feature][system] add all_gather and fix bug of multi rank doctest #6189
[feature][system][interface] [Feat.] Lazy interpreter skip hierarchical_parallel_cast #6208
[purge][system] Cleanup KernelUtil #6212
[enhancement][system] StreamContextAdapter #6205
[enhancement][system] Dev eliminate gcc warnings #6199
[feature][bottleneck][system][interface] [Feat.] nn.Graph support grad acc with input/output tensor #6155
[enhancement][system] Cpu symetric s to s #6153
[enhancement][system][upload-core] Op expr infer tensor meta #5064
[enhancement][system] Infer consistent tensor meta #5362

CI enhancements:

[bug][ci][api][interface] Refine module test #5232
[enhancement][ci] Add Simple CI, runs CPU-only on GitHub hosted servers #5207
[enhancement][ci] Run exe test in CPU-only #5202
[enhancement][ci] Cancel all workflow runs but the latest #5206
[enhancement][ci] Fix master not running Simple CI #5368
[enhancement][ci] Refine Simple CI and Clang analysis #5367
[enhancement][feature][bug][ci][documentation][interface] Fix upsample bilinear bug #5363
[enhancement][ci] Build nightly for py39 #5318
[enhancement][ci] Try distributed run for 3 times to prevent failure #5305
[enhancement][ci] Upload Simple CI logs to cloud #5268
[enhancement][ci] Remove cpu_op_eager and cuda_op_eager #5470
[bug][ci] fix segfault in clang plugin #5437
[enhancement][ci] Refine Simple CI error output #5435
[enhancement][ci] Add conda env to Simple CI #5385
[enhancement][ci] Fix clang plugin core file not found #5390
[bug][ci] upload core when build with clang plugin #5384
[bug][ci] clang plugin skip more files #5373
[enhancement][ci] Use gh-action-scheduler-v2 #5370
[enhancement][ci] relax speed threshold #5569
[bug][ci] Fix wrong test path under compatible #5567
[enhancement][ci][need-simple-ci] Prevent upload logs automatically #5560
[enhancement][ci][interface] Add nn.AdaptiveAvgPool1d and nn.AdaptiveAvgPool3d #5445
[feature][ci] add speed test in ci #5496
[enhancement][ci] Reduce usage of Simple CI #5546
[feature][bug][ci][api] Restruct upsample module #5524
[feature][ci] multi client launcher test #5488
[enhancement][ci] Remove automerge if cuda_new_interface failed #5519
[enhancement][ci] Prevent adding subdir in python/test #5514
[enhancement][ci] piprepo->pipindex #5517
[enhancement][ci] add dynamic_loss_scale in ci tests #5337
[enhancement][ci] Add timeout for wait_gpu_slot #5497
[enhancement][feature][ci] new static check based on clang-tidy #5476
[enhancement][ci] Fix url not downloadable in some browers #5701
[feature][ci] multi client multi machine test #5685
[enhancement][ci] Add cpu new interface CI #5639
[enhancement][ci][need-simple-ci] Mv clangtidy to simple ci #5667
[enhancement][ci][need-simple-ci] use clang tidy appimage in ci #5841
[enhancement][ci] Use gcc 7 in release to prevent error #5840
[enhancement][ci] bn tol 1e-4 => 1e-3 #5811
[enhancement][ci] fix distributed run on built dir #5810
[enhancement][ci] fix third party mirror check_sum #5802
[ci][documentation] find more accurately which files need to be doctested #5782
[enhancement][ci] Print stack unconditionally #5779
[enhancement][ci][need-simple-ci] Enable more checkers for clang-tidy in CI #5738
[enhancement][ci] CI: add clang-tidy check to test.yaml #5920
[ci][documentation] fix docstring in oneflow.nn.functional namespace #5807
[enhancement][ci] disable TREAT_WARNINGS_AS_ERRORS in Release CI #5886
[enhancement][ci] Skip ci jobs by git diff #5863
[bug][ci] quick fix #5978 #6030
[enhancement][bug][ci] fix clang tidy diff options and file format #5990
[enhancement][ci] add flow.relu #5847
[enhancement][ci] equal => allclose #6164
[bug][ci][need-simple-ci] CI: fix clang tidy checks in simple ci #6161
[enhancement][bug][ci][documentation][api] add interpolate and layer_norm docs #6157
[bug][ci] update speed test #6113
[enhancement][bug][ci][documentation][api] speed import oneflow #6107
[bug][ci] Also try install dev deps for CODEGEN_PYTHON_EXECUTABLE #6115
[bug][ci][need-simple-ci] set gtest_CMAKE_DEBUG_POSTFIX "d" #6085
[enhancement][ci] add cache init file for clang and CI build with clang #6062
[enhancement][ci] add emoji in speed test output, make it continue-on-error #6214

Test enhancements:

[bug][test][interface] Fix acos ci bug #5217
[feature][test] implement automated test #5321
[enhancement][test] move generator test into ops folder to accelerate tests #5472
[feature][test][api] Add autotest part2 #5467
[enhancement][test][api][interface] Add some tests with the new framework for auto testing #5561
[bug][test] fix test error when do multi case test on graph #5590
[enhancement][test] Refine module test using auto test by yaochi #5484
[enhancement][test] Add autotest for BatchNorm2d #5734
[enhancement][test] RTH_update_op_test #5823
[enhancement][test] dev adamw graph config #5745
[feature][test][api][interface] Add new autotest #5562
[bug][test] restore test of alexnet graph #5798
[enhancement][test][interface] add zhangshen op-test #5600
[feature][bug][tooling][test][interface] Record autotest wrong code #5923
[enhancement][feature][test][api] add randint #5718
[bug][test] fix multi machine test #5984
[enhancement][test][interface] some op test #6095

Tooling enhancements:

[bug][tooling] user/summary: fix memory leak in FillImageInSummary #5742
[enhancement][tooling][cfg] cfg: add move assignment operator for performance #5962
[enhancement][tooling][api][refactor] refactor_all_device_placement_api #6080

oneflow - v0.5rc2

Published by jackalcooper about 3 years ago

Changelog

v0.5rc2 (28/09/2021)

Highlights

First class support for eager execution. The deprecated APIs are moved to oneflow.compatible.single_client
Drop-in replacement of import torch for existing Pytorch projects. You could test it by inter-changing import oneflow as torch and import torch as flow.
nn.Module for eager execution
nn.Graph for lazy execution
DDP for data parallel

A sneak peek of the new API

Here is a minimum example showcasing how to incorporate a nn.Module in a nn.Graph and have it run in lazy mode.

class NeuralGraph(flow.nn.Graph):
    def __init__(self, ...):
        super().__init__()
        self.model = model # model is a nn.Module instance

    def build(self, x):
        y_pred = self.model(x)
        return y_pred

graph = NeuralGraph() # to create a nn.Graph instance
y_pred = graph(x) # to run the created nn.Graph

New in Python API

[feature][eager][op][test][python][interface] Add test for convtranspose2d #5239
[enhancement][python][interface] Add GroupNorm #5175
[enhancement][eager][python][interface] [Add] avgpool1d avgpool3d #5165
[feature][eager][op][python][interface] Add deconv cpu impl #5224
[bug][eager][api][python][interface] Fix acosh bug #5221
[feature][eager][op][python][interface] Dev modules ctc loss #5168
[bottleneck][bug][documentation][python][interface] Fix meshgrid test bug #5208
[eager][documentation][python][interface] Rename CosineScheduler to CosineAnnealingLR #5112
[feature][eager][python][interface] Add meshgrid module #5205
[enhancement][feature][bug][op][python] support bias in conv2d's parameter list #5322
[eager][documentation][api][python][interface] add not_equal, greater_equal and less_equal module #5350
[enhancement][eager][python] refine pow module and its test #5319
[enhancement][eager][op][python] Add triu op #5329
[enhancement][bug][python] Fix optimizer for not supporting all kinds of iterables #5355
[bug][python][interface] raise IndexError in get_canonical_index to support for loop #5345
[bug][python][interface] tensor slice assign supports broadcasting #5344
[enhancement][op][python] add cpu group conv logic #5314
[enhancement][python] Add 'nn.Mish' module and corresponding functions #5310
[enhancement][build][python] Remove ONNX from setup py #5297
[enhancement][python][interface] [add] zeropad2d #5278
[feature][system][python][interface] Lazy nn.Graph FeedInputOpExpr #5458
[feature][python][interface] integrate nn.image.flip #5411
[bug][python] Fix issues in point of MultiClientSession #5469
[enhancement][bug][python] update HasAllMultiClientEnvVars() #5459
[enhancement][python] Add in_top_k function #5428
[enhancement][python] Dev add docstring #5449
[feature][api][python] MultiClientSession #5407
[documentation][python] remove --user #5431
[feature][python][interface] nn.Graph python #5309
[feature][python][interface] Fea/nn graph/graph name #5413
[bug][python][interface] rm nn.Graph.train #5424
[op][documentation][api][python][interface] add bernoulli module #5353
[enhancement][python] flow.S/B/P #5306
[enhancement][documentation][python] Add instruction on upgrade pip #5400
[enhancement][python] Rm oneflow export and experimental #5589
[bug][python] Fix nn.graph.utils module conflict #5598
[feature][ci][python] Update autotest framework #5520
[enhancement][python] copy of_proto_python_dir to compatible_single_client_python #5539
[enhancement][api][python] del default env init #5537
[enhancement][python] Fix single client using same glog file #5535
[bug][api][python] Fix Session TryClose #5531
[enhancement][feature][python] split vector-matrix norm #5478
[feature][eager][op][python][interface] Add more upsample kernel #5382
[enhancement][feature][test][python] add torchstyle unittest #5489
[feature][system][python] nn.Graph with training #5662
[enhancement][feature][python] Fea/nn graph/block proxy func #5727
[enhancement][api][python] consistent_tensor_to_api #5703
[feature][eager][op][python] Dev Align torch avgpool #5610
[enhancement][python] fix circular deps of sbp python module #5706
[documentation][python] [part5]Remove singleclient outdated api #5674
[enhancement][python] [part4]Remove singleclient outdated api #5672
[bug][op][python] remove outdated code in conv3d #5696
[enhancement][test][python] enlarge tolerance of dataloader test #5689
[enhancement][test][python] add autotest for some math ops #5646
[feature][python] nn.Graph optimizer part 2: add L2, pass job complete, refactor #5604
[enhancement][python] Add clip_grad_norm #5299
[purge][python] Remove Single-Client API in oneflow default python #5827
[bug][python] Fix ddp grad size #5834
[enhancement][feature][python] Dev RMSprop graph conf #5768
[enhancement][purge][eager][python] remove scale arg in optimizer #5821
[enhancement][feature][python] graph/block io check #5803
[enhancement][feature][python] Dev adam graph conf #5709
[purge][python] [part10]Remove singleclient outdated api #5756
[feature][api][python] better repr of nn.Graph for debug #5762
[bug][python] fix weight decay in RMSprop #5755
[purge][python] [part9]Remove singleclient outdated api #5752
[purge][python] [part8]Remove singleclient outdated api #5750
[documentation][python] add first batch of methods in oneflow.nn.functional namespace #5693
[purge][python] [part6]Remove singleclient outdated api #5704
[bug][python] use default_generator.seed() as random_seed in init #5721
[bug][system][python] ddp broadcast params and buffers #5913
[enhancement][test][python] Add consistent tensor requires grad test #5925
[bug][python] wrap flow.nn.init.* with flow.no_grad() #5932
[feature][api][python][interface] add clip_grad to optimizer #5817
[enhancement][ci][op][test][python] add randperm with test and docs #5680
[feature][api][python] Fea/nn graph/ lr_schedule(and cosine lr_sch) and opt_group #5846
[bug][python] fix bug of SyncOnMasterFn atexit #5909
[purge][python] Delete single client nn modules #6061
[enhancement][python] Move framework.distribute to env #6022
[bug][python] skip sync when abnormally exiting #6025
[feature][python] Fea/nn graph/warmup amp config #5969
[documentation][python] add optimizer api docs #6131
[documentation][python] add_tensor_api_doc #6127
[bug][python] Fix test_grid_sample.py and test_affine_grid.py threshold #6125
[documentation][api][python] add doc of graph #6093
[bug][python] Fix make of_format fail in ubuntu #6120
[feature][api][python][interface] Fea/graph helpers #6088
[enhancement][eager][python][interface] Use flow.randint in dataloader #6086
[feature][eager][api][python][interface] Import oneflow as torch #6076
[enhancement][test][api][python][refactor] rename OfrecordReader to OFRcordReader #6090
[purge][python][need-single-client-tests] Delete single client nn modules #6082
[enhancement][python] flow.load tolerates FileNotFound fault #6083
[feature][python] Fea/pipeline in graph #6105
[enhancement][test][python] graph activation checkpointing #6192
[enhancement][feature][op][python] rnn test #6165

New in Ops:

[enhancement][op][api][refactor] [Functional] Part2: Add partial unary and math functional apis #5218
[enhancement][bug][op][interface] Refine deconv kernel #5229
[enhancement][op][api][interface] add ReflectionPad2d #5172
[feature][eager][op][api][interface] crossentropyloss and nllloss support ignore_index #5195
[feature][eager][op][api][interface] Yejiaojiao/dev bcewithlogitsloss #5173
[bug][ci][op] Dev user op set default is_dynamic #5223
[enhancement][op] add magic method for pow #5199
[enhancement][op][interface] add cpu version of upsampling #5194
[enhancement][bug][op][api][interface] add ReplicationPad2d #5148
[feature][eager][op][api][interface] add kldivloss module #5155
[feature][eager][op][documentation][build][api][interface] Add floor module and the corresponding testcases #4964
[enhancement][feature][op] Dev conv1d module #5280
[enhancement][op] Add ctc_greedy_decoder op #5294
[enhancement][op][system] Dev remove default grad func #5320
[enhancement][op][system] Add pad grad func. #5354
[enhancement][op][system] Add gradient funcs. #5348
[feature][purge][bug][eager][op][interface] fix upsample nearest bug #5347
[enhancement][op][system] [Functional] Part7: Migrate pooling ops #5253
[enhancement][op] nvjpeg hardware acc #5240
[enhancement][feature][ci][eager][op][api][interface] Add bmm module #5334
[enhancement][eager][op] Dev image decode eager #5333
[enhancement][op] Optimize softmax warp impl #4977
[enhancement][eager][op] Dev tensor buffer eager #5317
[enhancement][op][api][refactor] [Functional] Part6: Migrate conv op #5252
[enhancement][eager][op] Dev sort eager #5284
[enhancement][bug][op][api] fix bceloss bug in default weight and reduction #5303
[bug][eager][op] remove redundant assert and check #5264
[enhancement][bug][ci][op] fix bceloss bug about weight #5269
[enhancement][op][api][refactor] [Functional] Part5: Migrate nn ops #5249
[enhancement][eager][op] Dev argsort eager #5273
[enhancement][op][api][refactor] [Functional] Part4: Migrate array ops #5247
[enhancement][op][api][refactor] [Functional] Part3: Migrate binary and activation ops #5246
[bug][ci][op][test] Dev fix rmsprop ci fail #5481
[enhancement][op] add inplace method: Tensor.sin_ #5471
[bug][op] hotfix image_batch_align #5461
[enhancement][eager][op][interface] Dev maxpool series op 123d #5244
[bug][op] fix pool gpu kernel #5446
[feature][eager][op][api][interface] add pixelshufflev2 module #5383
[enhancement][feature][ci][eager][op][documentation][api][interface] Add flow xxx and tensor xxx autotest #5386
[enhancement][feature][eager][op][api][interface] Modules chunk #5324
[enhancement][eager][op] add image normalize for eager #5402
[enhancement][eager][op] Dev batch align module #5401
[enhancement][eager][op] add coco reader module #5391
[enhancement][wip][op] Restruct Elementwise kernel #4130
[bug][op] Fix DecodeRandom reuse mem #5606
[enhancement][op] Align pytorch maxpool #5525
[enhancement][bottleneck][eager][op][api] implementation of constantpad-3d op #5529
[enhancement][eager][op] Add scale size for resize #5509
[enhancement][op][api][refactor] Dev optimize tensor setitem #5501
[enhancement][op] register uint8 dtypeto support dataloader #5499
[enhancement][op] Add unique.cuh #5487
[enhancement][op][api][interface] Dev ofrecord auto truncating #5412
[feature][op][system][interface] Feat: LazyInterpret::ApplyImpl support SourceUserOpExpr and Copy #5711
[enhancement][op][interface] Dev logical_and/or modules #5636
[enhancement][op] support any number positional arguments for ones and zeros op #5698
[enhancement][feature][eager][op] Add conv3d Module #5327
[feature][eager][op][api][interface] add batchnorm3d module #5631
[bug][eager][op] fix reduce min max backward bug #5651
[enhancement][op] Debug dim scatter #5371
[enhancement][op][interface] Dev eye #5583
[enhancement][eager][op] Dev minimum maximum #5576
[enhancement][op] Restruct activation grad op #5669
[enhancement][feature][eager][op] Rewrite activation function #5465
[bug][op][documentation] add oneflow.cat for documentation #5621
[enhancement][op] Lcy logsoftmax #5746
[feature][op][need-simple-ci] Feat empty op #5659
[enhancement][eager][op] Dev split #5714
[enhancement][op][interface] add index_select op #5661
[bug][op] fix nvjpeg hw acc #5851
[enhancement][op] Remove move in conv_cudnn #5828
[enhancement][op][interface] Dev logical_xor module #5694
[bug][eager][op] fix squeeze #5808
[enhancement][op] Get parallel_id and parallel_num through rank and world size in DDP #5717
[bug][eager][op] delete interpolate int type #5805
[bug][op] Fix bug in scatter #5743
[enhancement][op] Refactor: remove module not required, call function directly #5754
[enhancement][op] Remove modules not required(tan, erfc, log1p, scatter_nd) #5791
[enhancement][op] Refactor scatter, clamp and pow in cpp instead of in python #5715
[enhancement][op] Rm useless code in gather files #5687
[enhancement][eager][op] change flip_code to scalar #5786
[enhancement][bug][op][interface] fix upsample bug #5753
[bug][op][interface] Quick fix Lazy nn.Graph input/output OpConf.BlobConf.is_dynamic #5767
[enhancement][bug][eager][op] fix argwhere 0-dim bug #5760
[enhancement][eager][op] delete unused code #5744
[feature][op] Export fused_scale_tril op #5933
[bug][op] Fix backward bug in 3d #5908
[bug][op] Fix one_hot api limit #5927
[enhancement][eager][op] Dev where scalar #5797
[bug][op] fix grad error #5914
[feature][bug][op] Fix inplace op circle reference bug #5910
[enhancement][op] Move the judgment content to c++， And add scalar fmod #5854
[enhancement][op] Support combined_margin_loss op in flow.nn.modules #5830
[enhancement][op][api][interface] functional_one_hot #5315
[enhancement][op] Dev scalar op #5778
[bug][eager][op] fix gather kernel 0 shape #5888
[enhancement][op] add l2_normalize for mutl-client interfaces #5859
[feature][op] Export function softmax_cross_entropy #6056
[enhancement][op] Add int attr for functional adaptive average pool #6059
[enhancement][op][interface] dev full op #5955
[bug][eager][op] fix 0dim inplace add #6029
[feature][op][system][interface] Feat: nn.Graph image gpu decoder #6014
[enhancement][op][interface] dev optim_optim_lr_scheduler_multisteplr #5975
[enhancement][op] NopKernel #6035
[enhancement][eager][op][api] Dev tril op #6005
[enhancement][op] dev unfold and fold #5675
[enhancement][op] ResNet CUDA Graphs #6018
[enhancement][feature][op] add broadcast pow #6013
[enhancement][op][interface] init of op diag #5298
[op][documentation][api] Fix api document bug #6009
[enhancement][op] Dev fused functional #5954
[bug][op][build] Add nvcc flag -Werror cross-execution-space-call #6002
[bug][op] Fix Normalization grad function #5993
[enhancement][feature][eager][op][test][interface] Add fused self attention #5966
[enhancement][bug][ci][eager][op][api][interface] Try to fix var bug #5973
[enhancement][feature][eager][op][interface] add prod op #5867
[enhancement][eager][op][api] add glu op #6065
[enhancement][op] Align Torch.nn.functional poolXd #6184
[bug][eager][op] fix backward index for gamma beta #6149
[bug][op][system] Fix BroadcastMatmulGrad bug #6168
[enhancement][op][api] Add Int support for functional.avg/maxpool #6174
[bug][eager][op][api][interface] align dropout api name with pytorch #6170
[enhancement][op] support inplace operation for hardsigmoid #6137
[enhancement][bug][op] Fix do bias correction in Adam/AdamW #5960
[bug][eager][op][api][interface] fix repeat 0-dim tensor bug #6150
[enhancement][bug][op] Fix select_first_grad bug #6142
[bug][ci][eager][op][documentation][interface] Add clipgrad doc and contiguous #6130
[bug][op] Fix eager optim dynamic attr bug #6111
[enhancement][op] Support grid_sample and affine_grid operator #6038
[op][documentation] Export apis for documentation #6068
[enhancement][feature][bug][ci][eager][op][documentation][interface] transfer python function to c++ method #6114
[op][documentation] Dev functional batch_gather #6233
[enhancement][op][test] fix cross_entropy_loss and its test #5799
[bug][op] Use attr nd_sbp to check consistent #6222
[enhancement][op] Dev fused bn functional #6077
[enhancement][op] support default value in intlist #6201
[bug][op] fix sparse_softmax get_nd_sbp #6203
[bug][op] Fix bug in model fused update #6197
[enhancement][op][system][refactor] Optimize tensor getitem. #5433

New in Eager:

[enhancement][eager][interface] Reconstruct module files #5251
[bug][eager][documentation][interface] Fix conv module bug #5245
[bug][ci][eager][interface] Fix bce withlogitloss ci error #5237
[feature][eager][api][interface] module BCELoss #5144
[enhancement][feature][eager][api][interface] Dev norm op #5178
[enhancement][bug][eager] Fix stack module #5222
[enhancement][feature][eager] Support different dtype of equal module #5214
[enhancement][bug][eager][documentation][api][interface] Add nllloss backward #5210
[enhancement][eager][api][upload-core] Decouple FileSystem and IOConf #5162
[enhancement][ci][eager] Set lower precision avoid ci failing #5200
[eager][documentation] Add hint when apply FunctionNode second time #5369
[enhancement][feature][bug][ci][eager][documentation][api] Fix upsample bilinear bug #5366
[bug][eager] Fix not contiguous ndarray to tensor bug #5351
[enhancement][eager][system] Infer consistent tensor meta #5118
[feature][eager] Feat graph autograd engine #5296
[enhancement][eager][interface] Dev type as module #5349
[feature][eager][documentation][api][interface] Add new ones module #5342
[enhancement][bug][eager] Fix logical slice assign dtype #5339
[bug][ci][eager][documentation][api][interface] Fix where module bug #5300
[bug][ci][eager][documentation][api] Fix l1loss ci error #5307
[enhancement][bug][eager][documentation][api][interface] Qi's First Edit of deleting "print" and ".numpy" #5129
[feature][eager][refactor] Separate autograd meta to tensor #5267
[feature][eager][api][interface] add tile module #5234
[enhancement][eager] Release lambda function to reuse tensor memory #5266
[feature][bug][eager][documentation] Fix default value not set bug #5483
[enhancement][eager][interface] [Add] gather_nd scatter_nd #5422
[enhancement][bug][eager] fix param #5473
[bug][eager] Fix Tensor.grad setter bug #5462
[enhancement][eager] Rename now_grad_arg to current_grad #5466
[eager][test][documentation][interface] Add autotest part1 #5436
[enhancement][eager] Use functional copy instead of op_builder #5460
[bottleneck][bug][eager][interface] fix -1 index not support bug #5448
[bug][ci][eager][documentation][api] Fix concat backward bug #5443
[enhancement][bug][ci][eager] Add autograd engine warning #5444
[feature][eager][api][interface] Smoothl1loss #5256
[enhancement][bottleneck][eager] remove device dtype params #5434
[bug][ci][eager][documentation][interface] Delete maxpool failed test #5409
[enhancement][eager][api] Add tensor grad assginment #5379
[enhancement][bug][eager] fix-abs #5398
[enhancement][bug][eager][interface] Fix bn track running stats #5393
[enhancement][bug][eager] Support uint dtype of constant op #5396
[enhancement][bug][eager][documentation][interface] Delete useless code upsample #5392
[enhancement][ci][eager][interface] add flow.view #5301
[enhancement][bug][ci][eager][api][interface] Add masked select module #5356
[bug][eager][interface] Fix batchnorm backward bug #5602
[enhancement][eager] Support weight_dacay(l2 actually) #5587
[feature][eager][documentation][api] Add new autotest #5588
[enhancement][eager][documentation][api] Dev fmod #5404
[feature][eager] Support inplace add #5432
[feature][eager][interface] Feat tensor stride property #5543
[enhancement][feature][eager][documentation][api] Add flip module #5541
[feature][eager] Feat module repr #5486
[enhancement][bottleneck][bug][eager][interface] Fix maxpool1d params #5493
[enhancement][feature][eager][interface] Dev flow.utils.data part1 #5406
[bug][eager][api] Fix tensor getitem bug #5474
[enhancement][eager][need-simple-ci] export datasets interface #5691
[enhancement][eager][system] rebase #5601
[enhancement][eager][test] added nn.RecordBytesDecoder with its test #5475
[enhancement][feature][eager][need-simple-ci] 0-dim tensor support #5552
[enhancement][bug][eager] rewrite slice_update backward #5677
[enhancement][bug][eager][interface] align view input style with torch #5676
[enhancement][eager][interface][need-simple-ci] add autotests for modules #5666
[enhancement][bottleneck][eager][interface] Dev constantpad1d op #5579
[enhancement][eager][api][interface] Restruct MathOps AutoTest #5654
[enhancement][bug][ci][eager] Fix flip bug #5657
[bug][eager][api][interface] Fix expand module bug #5650
[enhancement][bug][eager][documentation][api] Fix repeat bug #5633
[enhancement][eager][test][api][interface] Add new autotest #5617
[enhancement][eager][api][interface] Dev flow.utils.data part2 #5500
[enhancement][bug][eager] make setitem device match #5835
[bug][eager][api][interface] align reshape input param with pytorch #5804
[feature][bug][eager][api] Align where op with torch #5850
[enhancement][bug][eager][api] Restruct prelu op #5829
[bug][eager][need-simple-ci] fix pooling ceil_mode bug #5818
[enhancement][eager] stateful local kernel supports consistent #5789
[bug][eager][api][interface] Fix argwhere bug #5816
[enhancement][eager][documentation][api] dev-nonzero #5809
[enhancement][feature][eager][api] Add fake quantize op #5690
[enhancement][bug][eager][documentation][api] Add api #5663
[enhancement][eager] Refactor consistent infer result #5790
[bug][eager][need-simple-ci] skip dataloader test #5780
[bug][eager][need-simple-ci] fix 0-dim tensor.fill_ #5771
[enhancement][eager] Cpu mpi broadcast #5726
[feature][eager] Feat grad mode classes #5956
[enhancement][bug][eager] fix wrong names #5951
[enhancement][eager][system] Local dep object pool #5953
[enhancement][eager][interface] rename OpExprInterpState to AutoGradCaptureState #5918
[bug][eager] Fix linear bug #5945
[bug][eager] Fix tensor_meta update bug #5924
[enhancement][eager] use flow.randperm #5928
[enhancement][eager] consistent init/save/load #5896
[enhancement][bug][eager][documentation][interface] Restruct sort and argsort op #5911
[enhancement][bug][eager][interface] Try to fix the problem that the insightface cannot converge。 #5906
[enhancement][bug][eager][interface] Add autotest #5899
[enhancement][eager] The scheduler thread joins worker threads #5893
[enhancement][eager] Bugfix async callback #5881
[feature][eager] Feat tensor to bool #5836
[bug][eager] Remove inplace broadcast_add #5551
[enhancement][eager] Broadcast consistent shape and dtype #5784
[enhancement][eager] Fix optimizer list parameters input bug #5848
[enhancement][eager][interface] Dev flow.utils.data part3 #5644
[enhancement][eager][api] Normalize naming of modules #6066
[enhancement][feature][eager][api][interface] add truncnormal #6051
[enhancement][bug][eager] AutoMatedTest support test module.parameter.grad #6043
[enhancement][feature][bug][eager] add module call kwags #6069
[enhancement][eager][api][interface] add tensor.item tensor.tolist #6021
[enhancement][eager][api][interface] Export pool ops api #6047
[enhancement][bug][eager][test][documentation][interface] Add more autotest sample #6039
[enhancement][bug][eager][system] disable cuda_h2d stream #6020
[feature][eager][test][api][interface] Add autotest codegen #6019
[feature][eager][documentation] Refactor cosine lr scheduler #6000
[enhancement][eager][interface] tensor.cpu/tensor.cuda #5894
[enhancement][eager][api] Support consistent_tensor.to(dtype) #5991
[bug][eager][interface] remove redundant codes in ModuleDict #5961
[bug][eager] Fix LayerNorm check bug #6196
[enhancement][eager][api] Change dropout api #6182
[enhancement][good for pr][eager][api][interface] add: test convert dependency #6023
[enhancement][bug][eager][interface] Fix autotest codegen bug #6171
[bug][eager] restore instr_local_dep_object_pool_size for nccl #6160
[enhancement][eager][api][interface] Aligin pooling op functional api names with torch #6163
[feature][bug][eager][api][interface] delete file #6162
[bug][eager] Fix optim load_state_dict bug #6152
[enhancement][eager][api] add is_training to dropout functor #6148
[enhancement][eager] Decompose nd sbp boxing #5800
[enhancement][eager] support consistent_tensor.to(copy=True) #6122
[feature][eager] Static grad scaler #6135
[bug][eager] Fix LayerNorm expr bug #6121
[bug][eager][api] move numpy c api init in numpy.cpp, make np array contiguous before copying #6117
[enhancement][eager][refactor] Remove params from ParamGroup getitem #6096
[enhancement][feature][eager] Support tensor and optimizer serialization #6087
[enhancement][bug][eager] fix bug about tensor str in nonsymmetric cast and getitem in consist… #6239
[enhancement][eager] Cpu all reduce #5849
[feature][eager] Support assign copy interface #6228
[enhancement][eager][api][interface] Dev reconstruct pad ops #6223
[enhancement][eager][api][interface] support flow.cuda.is_available #6124
[bug][eager] make flow._C.local_all_reduce sync lanuched #6175
[enhancement][eager] Rename flow to oneflow in user hint #6190
[bug][eager][tooling][test][api][interface] Autotest generate input tensor #6206
[enhancement][eager] consistent tensor zeros_() #6202
[enhancement][eager] Cpu mpi #5865

Build enhancements:

[bug][build] Fix GRPC compilation failure on CMake 3.20 #5255
[bug][build] Refine header file copy #5254
[bug][build] Fix older version CMake doesn't support multiple targets in CLI #5248
[bug][build] Turn off NCCL_STATIC/CUDNN_STATIC when CUDA_STATIC is OFF #5243
[feature][build] Fix support for Ninja and add Ninja build in Simple CI #5236
[enhancement][build] Add cmake option CUDA_STATIC #5164
[bug][build] Fix protobuf debug postfix #5233
[enhancement][ci][build] Move default third party dir into build dir #5230
[enhancement][build] Refine protobuf cmake #5216
[enhancement][ci][build] Remove transport test main #5215
[enhancement][ci][build] Speedup opencv build #5213
[enhancement][build] Support clang #5015
[enhancement][documentation][build] Add prefix when creating git archive #5201
[enhancement][build] Add cmake option NCCL_STATIC #5160
[enhancement][build] Refine CMake CUDA version handling #5192
[enhancement][build] Use clang plugin to check Maybe variables are used #5358
[enhancement][build] Add BUILD_BYPRODUCTS for ExternalProject_Add #5316
[enhancement][build] Add cmake init cache to simplify user onboarding #5311
[feature][bug][build] Fix macOS support and run macOS build in Simple CI #4947
[enhancement][build] flatbuffers use mirror #5295
[enhancement][build] Don't build test by default #5302
[enhancement][build] Prevent building from scratch when toggle flag BUILD_GIT_VERSION #5259
[enhancement][build] Refine gRPC, glog, gflags cmake for conda #5276
[feature][build] Support XLA with CPU-only #5260
[enhancement][ci][onnx][build] Remove ONNX from CI #5257
[enhancement][build] Refactor build_wheel to support oneflowinc images #5427
[enhancement][build] Add arg skip_audit in build wheel #5423
[bug][build] hwloc disable shared #5388
[documentation][build] Update readme for autoconf and libtool #5376
[enhancement][build] remove dir python and compatible_single_client_python #5609
[bug][build][system] Fix pyyaml version #5594
[enhancement][ci][build] force release flags #5574
[bug][build] prevent endless loop #5534
[enhancement][build] Support sccache #5528
[enhancement][build] Add definition for CMAKE_BUILD_TYPE and print cmake_build_type in oneflow doctor #5505
[enhancement][ci][build][need-simple-ci] Fix macOS for recent changes #5705
[bug][build] fix return type error on gcc 4.8.5 #5660
[enhancement][build] Check CMAKE_BUILD_TYPE #5656
[enhancement][build] add -Werror=return-type #5655
[enhancement][build] Clean and fix for new py dir #5618
[enhancement][build] cmake: disable array-bounds check & treat warnings as errors for pyextobj and oneflow_internal & fix warnings #5838
[bug][build] set CMAKE_BUILD_TYPE to Release if undefined #5842
[enhancement][build][need-simple-ci] Fix all warnings & Add option TREAT_WARING_AS_ERROR to cmake #5751
[enhancement][build] add CMAKE_INTERPROCEDURAL_OPTIMIZATION in fast cmake cache #5970
[enhancement][build] add clang tidy target #5957
[bug][build] cmake: fix cmake cache args in opencv #5959
[enhancement][build] Add cmake option USE_SYSTEM_NCCL #5897
[enhancement][build] cmake: include third party headers as system headers to avoid warnings #5879
[enhancement][build] Ignore opencv-python on machine aarch64 #5884
[enhancement][build] enable CMake first class cuda support #5858
[bug][build] Fix compile warning (strict-aliasing) #5872
[enhancement][bug][build][need-simple-ci] Upgrade gtest and fix some errors raised by clang #6079
[bug][ci][build] cmake: fix ninja build in CI #6072
[bug][build] fix files not actually removed when building for multiple python versions #6060
[bug][build][api] functional_api: fix build error in mac os #6010
[bug][build][need-simple-ci][need-single-client-tests] Fix recompile from scratch #6036
[bug][build] Turn on NVCC's warnings #6011
[bug][build][need-single-client-tests] fix bundle .so of other python version #6034
[bug][ci][build][need-single-client-tests] use copy_all_files_in_dir to replace copy_files #6033
[enhancement][build] check compiler version in cmake #6026
[enhancement][build] Add CUDA_NVCC_THREADS_NUMBER #6017
[enhancement][build][need-simple-ci] optimize of_include_copy #5978
[enhancement][ci][build][need-single-client-tests] CI: remove -DTREAT_WARNINGS_AS_ERRORS=OFF #6008
[enhancement][build][xla] xrt: fix all warnings #5915
[enhancement][build] Prevent opencv compile failure with std 17 #5997
[enhancement][build] Use bundled cub #5998
[enhancement][ci][build] update clang tidy diff warnings-as-errors option #5989
[enhancement][build] Update run_clang_tidy.py to set return code and add warning-as-errors #5977
[enhancement][build] check: fix clang-tidy-diff commands #5972
[bug][build] Suppress NVCC warning #177-D #6094

XLA enhancements:

[bug][xla] Make the blob header memory aligned. #5286

System:

[enhancement][system] Refactor Memory Zone #5072
[enhancement][system] Add interface InferContext::OutputTensorDesc #5219
[bug][system] Lazy construct functor to make sure that the operators has already been registered. #5225
[enhancement][system] Refactor infer ctx output isdynamic #5220
[enhancement][system] Refactor infer ctx input isdynamic #5211
[enhancement][system] Wake up the heartbeat thread immediately #5081
[enhancement][system] Fix xla test case fail #5203
[enhancement][system] Add interface InferContext::InputDType #5153
[purge][system] delete const_cast in Output #5196
[feature][system] Add hwloc for topology detection #5291
[enhancement][system] fix registry may segment #5336
[enhancement][system] Use functional api instead of op_expr_helper::XXXOp. #5364
[enhancement][system] move btob to op #5274
[documentation][system] Add Latest News section in README #5361
[enhancement][bug][system] fix dropout module: return directly if not training #5346
[bug][system] add missing JUST #5357
[documentation][system] Add more communication outlets on README #5359
[enhancement][feature][system] CommNet dynamic register memory #5281
[enhancement][system] Use symbol device #5341
[enhancement][system] fix multithread bug in env #5283
[bug][system][api] fix bug in cfg_replacement #5335
[bug][system] Fix create log directory thread-unsafe #5326
[bug][system] fix_bug_in_make_parallel #5328
[enhancement][system][cfg] replace train_conf, job_conf using cfg::xx #5263
[enhancement][system][quantization] support tensorrt in qat #5287
[enhancement][system][api] Export functional apis for oneflow.experimental. #5313
[enhancement][system] fix bug check between cfg enum and proto enum #5285
[enhancement][system] replace CHECK_EQ using CHECK_EQ_OR_RETURN #5279
[enhancement][system] Refactor SbpXXX to cfg::SbpXXX #5120
[enhancement][system][api] add detach for LazyMirroredtensorImpl #5270
[enhancement][system] shorten XXIsDynamic4ArgNameAndIndex to be xxIsDynamic #5265
[enhancement][system][cfg] job_config to cfg #5235
[feature][system] Multi-Client LogicalRun degenerate to PhysicalRun #5479
[enhancement][system] fix ConstructOp without JUST #5480
[enhancement][system] Output arg modifier return maybe part 1 #5451
[feature][system][interface] Fea/nn graph/graph build ctx #5420
[enhancement][system] Throw exception if check failed #5457
[feature][system] multi client launch #5372
[enhancement][system][api] Optimize reduce mean #5452
[enhancement][system] export Tensor only to python #5440
[enhancement][system] Output arg modifier return maybe part_0 #5447
[enhancement][system] ThreadMgr support AddPlan #5450
[enhancement][system] Refactor infer ctx input tensordesc #5226
[enhancement][system][api] instruction builder return maybe #5442
[feature][system][interface] MultiClientSessionContext #5421
[enhancement][feature][system] add launcher, update multi client launch and exit #5414
[purge][system][refactor] Remove IOConf #5419
[enhancement][system] Dev refine generator #5426
[enhancement][system] Support inplace operations #5204
[enhancement][system][refactor] Dev refactor generator #5397
[enhancement][system] Add new placement init func #5408
[enhancement][system] NNGraphIf #5387
[enhancement][system][refactor] Cast explicitily in unpack call to avoid confilt with Optional. #5380
[enhancement][system][interface] [Random Generator] Part2: Migrate functional dropout #5378
[enhancement][system] replace ForeignJobInstance using JobInstance #5374
[enhancement][system][refactor] Speedup reshape module by 5x. #5381
[feature][system][interface] [Random Generator] Part1: Dev random generator #5360
[enhancement][system] Add ONEFLOW_STREAM_CUDA_EVENT_FLAG_BLOCKING_SYNC #5612
[enhancement][system] [part2]Remove singleclient outdated api #5568
[feature][system][interface] nn.Graph call and launch impl #5580
[enhancement][system] remove outdated doctest api and "@experimental_api" #5564
[feature][system][interface] Register ForeignCallback and Watcher in Multi-Client #5591
[enhancement][system] [Part-1]remove outdated api and files of multi-client on master branch #5556
[feature][system][interface] LazyInterpret build LocalTensor if input is local #5582
[enhancement][system] add job_pass MultiClientAutoSourceAndSinkTick #5507
[feature][system] Fea/nn graph/optimizer #5533
[feature][system][interface] New/CloseRuntimeBuffers and RunLazyJob impl #5571
[feature][system][refactor][interface] NNGraph interface and implement for CompileAndRuntime #5558
[feature][system] Fea/nn graph/forward graph #5516
[enhancement][system] Lazy job stream type #5389
[enhancement][system] Refactor single client autotick #5506
[enhancement][system] replace underline using dot in single client #5547
[bug][system] fix return type #5548
[feature][system][interface] LazyInterpret for UserOpExpr #5544
[enhancement][system] Add ProfilerStart/ProfilerStop API #5542
[feature][system][interface] LazyInterpreter for FetchOutputOpExpr and set op parallel_distribution #5527
[enhancement][system] Multi client push pull #5492
[enhancement][system] registry_callback_fn return maybe #5456
[enhancement][system] bw_gen_fn return maybe #5455
[enhancement][system] gen_bw_fn return maybe #5454
[enhancement][system] Compatible single client #5417
[feature][system][interface] GlobalMultiClientEnv and refine EagerExecution #5523
[enhancement][system] Job pass maybe system #5503
[enhancement][system] Remove Plan::net_topo #5502
[feature][system][interface] LazyInterpret for FeedVariableOpExpr #5490
[enhancement][system] Input arg modifier return maybe #5453
[feature][system][interface] Fea/nn graph/block scope #5498
[feature][system] jit_fuse_cast_scale #5332
[enhancement][system] Remove obsolete Profiler #5747
[enhancement][system][api] Dev fix batch norm not stats #5733
[enhancement][system] rename rpc_token to TransportToken #5735
[enhancement][system][api] Refacotr maximum minimum py2cpp #5724
[enhancement][system] Replace piece_id with comm_net_sequence_number #5731
[enhancement][system] beautify stack frame #5686
[enhancement][system] Add env ONEFLOW_KERNEL_DISABLE_BLOB_ACCESS_CHECKER #5728
[enhancement][system] Add env ONEFLOW_THREAD_ENABLE_LOCAL_MESSAGE_QUEUE #5720
[enhancement][system][api][refactor] Refactor functional sub, mul and div apis #5713
[feature][system] ddp #5008
[enhancement][system][api][refactor] Refactor functional matmul and add apis. #5697
[bug][system] Fix ClearKV("plan") #5710
[enhancement][system] Rename cpu to async cpu #5712
[enhancement][system] Support tensor.to()/to_local() #5271
[feature][system][refactor][interface] Multi-Runtime for multi nn.Graph #5683
[bug][system][refactor] Add tag for Optional inplace constructor #5619
[enhancement][system] Move Global to env scope #5670
[enhancement][system] add JUST wrapper #5681
[enhancement][system] New sync consistent meta info #5634
[enhancement][system][refactor][interface] Refactor RuntimeCtx for multi-runtime #5664
[feature][system][interface] Feat: memory shared between EagerTensor with VariableRegst #5649
[enhancement][system] Use functional call directly instead of construct a module and then call-Add #5613
[enhancement][system] disable eager_op consistent mode #5647
[enhancement][system] add msg_penddin_list in ibverbs_qp to optimize qp_init_attr.cap.max_send_wr #5485
[enhancement][system] IBVerbsCommNet add knobs #5626
[enhancement][system] Prune python tensor #5596
[feature][system][interface] Feat: LazyInterpret infer op / tensor ParallelDescScope #5625
[enhancement][system] Replace src tick with with wait and send ids #5603
[enhancement][system] Support symbol placement type in functional. #5627
[enhancement][system][api][refactor][interface] Dev advanced indexing #5559
[enhancement][system] Optimize maybe. #5839
[enhancement][system] Decorator 4 disable recursive boxing call #5796
[enhancement][system] add_eager_boxing_and_op_interpreter_dispatch_error_info #5819
[enhancement][system] Kernel CUDA Graphs Support #5725
[bug][system] Fix placement print bug #5853
[bug][system] when error msg formatting fails, return error->DebugString #5844
[enhancement][system][refactor] Rename variables named *parallel_distribution* to *nd_sbp* (1) #5815
[feature][system][interface] Support Free EagerTensor caught in nn.Graph build #5777
[enhancement][system] Reuse CUDA event / Refine BnInOp2Blob / Refine channel #5837
[enhancement][system][serving] fix bug in AddInputOutputOpsPass: check existence of key in HashMap(inferface_lbi2scope_sym_id) #5653
[enhancement][system][api] unpack_call: impl new unpack_call_dispatcher for better performance #5820
[feature][system] Feat consistent tensor python constructor #5812
[feature][system] Support 0shape tensor #5620
[documentation][system] fix launcher description #5770
[feature][system][interface] Multi-nn.Graph memory reuse by Chunk manager #5658
[bug][system] Fix naive b2p error #5806
[enhancement][system] set created generator with default rng seed #5801
[enhancement][system] enhance_local_to_consistent #5761
[feature][system] add flow.randn #5736
[enhancement][system] Refactor hierarchical parallel cast autograd #5764
[enhancement][system] Collective boxing executor add_plan delete_plan #5495
[enhancement][system] Fix throw abort #5795
[enhancement][system] DECORATE #5794
[enhancement][system] Inferface eager boxing #5682
[enhancement][system] extract_consistent_to_consistent_op_expr #5870
[enhancement][system] disable backward pass consistent tensor meta check. #5871
[enhancement][system] Add CudaStreamIndexGenerator::GenerateNamedStreamIndex #5940
[bug][system] Only query PCI bus id when CUDA version >= 11 #5937
[enhancement][system] maybe: add JUST_MSG and CHECK_JUST_MSG #5904
[bug][system] Fix bug scalar #5950
[enhancement][system] framework: fix rvalue reference warnings #5948
[purge][system] Remove CudaWorkType #5942
[enhancement][system] refactor_symbol #5941
[bug][system] consistent_tensor_infer_cache: fix memory leak #5938
[feature][system] support to print gpu #5936
[enhancement][system] Bugfix static check #5935
[bug][system] fix nccl_version log #5934
[bug][system] Fix bug of multi-GPU train nn.Graph extra mem cost in rank 0 #5930
[enhancement][system] Only gradient acc be scheduled in parallel. #5926
[enhancement][bug][system] fix_ddp_bug_on_8_process #5929
[enhancement][system] Fix bug error msg format #5866
[feature][system] print consistent tensor data #5902
[bug][system] Move parse env to the constructor #5922
[enhancement][system] Remove GlobalWorkStreamId/GlobalThrdId #5917
[bug][system] shared_or_scalar: fix alias warnings #5916
[purge][system] Remove CompActor #5919
[enhancement][system] Use symbol dtype #5641
[enhancement][feature][system] Control Graph / Session / Env's python c++ object destruction #5845
[enhancement][bug][system] Sync access and assign indexing tensor. #5907
[enhancement][system][api][refactor] Dev consistent arange #5883
[enhancement][system] Lazy interpreter for new ConsistentToConsistentOpExpr #5903
[bug][system] Fix BUG of LazyInterpret FreeEagerTensor memory shared with regst #5891
[bug][system] fix typo in raise RuntimeError #5890
[enhancement][system][refactor] Rename the ParallelDistribution class to NdSbp #5814
[feature][system] add flow.rand #5722
[feature][system] Lazy Interpret support infer default device cpu #5880
[enhancement][system] Tensor str #5783
[feature][system][interface] Lazy to_consistent #5774
[enhancement][system] wait vm empty before exiting #5860
[enhancement][system] Eager boxing n to 1 #5949
[enhancement][system] add kernel observer #6052
[enhancement][ci][system] Optimize ddp broadcast and add speed/memory test in ci #6044
[enhancement][system] add var to control only print warning once when blocked #6045
[enhancement][system][refactor] Rewrite pow and logical functional apis #6032
[enhancement][system] Token seq id #5964
[enhancement][documentation][system] Remove python function wrapper. #6012
[feature][system] Add timeout and loc for blocking calls #6007
[enhancement][system] Eager boxing 1 to n #5943
[enhancement][system] Boxing expr #6015
[enhancement][system] new_X_to_B #5987
[enhancement][system] Add unimplemented return information #5952
[enhancement][system] Revert "Faster decorator" #6006
[enhancement][system] Throw exception if using advanced indexing for tensor setitem #6001
[enhancement][system] Support eager boxing sm 2 sn #5869
[enhancement][system] Move framework/local_dep_object.* to the eager directory #5988
[enhancement][system] Fix builtin op arg tuple. #5464
[feature][system][refactor] Dev functional multiple signatures #5982
[enhancement][system] Faster decorator #5996
[enhancement][system] Placed nd sbp #5995
[feature][system] Support asymmetric input/output/variable tensors in nn.Graph #5983
[enhancement][system] LightActor #5868
[bug][system] Prevent running oneflow in forked subprocess #5976
[bug][system] common/error: fix build error in mac os #5971
[bug][system] fix_bug_test_tensor_str #5958
[enhancement][system] Refine StreamContext #6191
[enhancement][system] container_util: fix VectorAt, remove useless MutMapAt #6172
[enhancement][system] Typesafe KernelState #6198
[enhancement][system] Primitive based copy task node #6195
[feature][system][interface] Lazy support Scalar #6181
[enhancement][system] Disable implicit boxing when parallel num eq one #6188
[enhancement][system] Primitive #6183
[enhancement][system] Remove IDMgr::GetGpuPhyIdFromThrdId/IDMgr::GetDeviceTypeFromThrdId #6169
[enhancement][system] remove op_expr_helper inside gradient_funcs #6057
[feature][system][api] Add tensor yaml, support export tensor functional api. #6099
[feature][system] Plan memory log #6151
[feature][system] Add dtype bfloat16 #5304
[enhancement][system] StreamContext #6129
[bug][system] Fix wrong inplace acc grad #6146
[enhancement][system] UserKernel remove job_desc #6144
[enhancement][system][api] Fea/graph/add outputs buffer to enable pipeline #6126
[enhancement][system] not fuse request for nccl 2.10.3 #6136
[bug][system] NewUniqueId thread safe #6141
[enhancement][system] XRT remove job_desc #6139
[enhancement][system] SystemOpFillJobNamePass #6138
[enhancement][system] mv_boxing_folder_to_core #6140
[enhancement][system] Refactor boxing interpreter to boxing expr #6134
[enhancement][system] Eager boxing one to one #6048
[enhancement][system] Vm cpu efficiency #6110
[enhancement][system] Naive generic boxing #6116
[feature][system] send/recv #5992
[enhancement][system] disable_print_stack_in_tensor_numpy #6123
[feature][system] add all_reduce by to_consistent #5963
[enhancement][system] KernelContext #6084
[enhancement][bug][system] Fix sync nccl and async nccl deadlock #6071
[bug][system][refactor] Refactor to local #6098
[enhancement][system] Replace xor with hash combine (part 1) #6078
[enhancement][system] Optimize error message #6073
[enhancement][system] Rename Error::xx to Error::xxError #6049
[enhancement][system] send formatted msg to glog #5999
[feature][bottleneck][bug][system][interface] [Feat.] NNGraph new eager tensor for new variable created in JobPass #6091
[bug][system] Fix bug of multi-GPU eager copy D2H extra mem cost in rank 0 #6092
[enhancement][system][api] Rename module flow.F to flow._C #6053
[feature][system][interface] [Feat.] Eager consistent OFRecordReader #6089
[enhancement][system][api] Dev fix and align interface #6075
[feature][bottleneck][bug][system][interface] NNGraph input/output valid by register tensors #6240
[bug][system][interface] Fix bug of Multi-Client src tick output order #6221
[enhancement][bug][system] Add cast primitive #6234
[feature][bottleneck][system][interface] Auto FixPipelineStageIdPass #6204
[enhancement][system] move scalar to oneflow namespace. #6235
[enhancement][system] UserKernel init CUDA Graphs with state #6230
[feature][system] Comm broadcast #6213
[enhancement][system][refactor] Rename opname to optype_name in AutogradEngine #6154
[enhancement][system] Add memset primitive #6218
[enhancement][system] Add StreamContext::device_type()/DeviceCtx::device_type() #6217
[feature][system] add all_gather and fix bug of multi rank doctest #6189
[feature][system][interface] [Feat.] Lazy interpreter skip hierarchical_parallel_cast #6208
[purge][system] Cleanup KernelUtil #6212
[enhancement][system] StreamContextAdapter #6205
[enhancement][system] Dev eliminate gcc warnings #6199
[feature][bottleneck][system][interface] [Feat.] nn.Graph support grad acc with input/output tensor #6155
[enhancement][system] Cpu symetric s to s #6153
[enhancement][system][upload-core] Op expr infer tensor meta #5064
[enhancement][system] Infer consistent tensor meta #5362

CI enhancements:

[bug][ci][api][interface] Refine module test #5232
[enhancement][ci] Add Simple CI, runs CPU-only on GitHub hosted servers #5207
[enhancement][ci] Run exe test in CPU-only #5202
[enhancement][ci] Cancel all workflow runs but the latest #5206
[enhancement][ci] Fix master not running Simple CI #5368
[enhancement][ci] Refine Simple CI and Clang analysis #5367
[enhancement][feature][bug][ci][documentation][interface] Fix upsample bilinear bug #5363
[enhancement][ci] Build nightly for py39 #5318
[enhancement][ci] Try distributed run for 3 times to prevent failure #5305
[enhancement][ci] Upload Simple CI logs to cloud #5268
[enhancement][ci] Remove cpu_op_eager and cuda_op_eager #5470
[bug][ci] fix segfault in clang plugin #5437
[enhancement][ci] Refine Simple CI error output #5435
[enhancement][ci] Add conda env to Simple CI #5385
[enhancement][ci] Fix clang plugin core file not found #5390
[bug][ci] upload core when build with clang plugin #5384
[bug][ci] clang plugin skip more files #5373
[enhancement][ci] Use gh-action-scheduler-v2 #5370
[enhancement][ci] relax speed threshold #5569
[bug][ci] Fix wrong test path under compatible #5567
[enhancement][ci][need-simple-ci] Prevent upload logs automatically #5560
[enhancement][ci][interface] Add nn.AdaptiveAvgPool1d and nn.AdaptiveAvgPool3d #5445
[feature][ci] add speed test in ci #5496
[enhancement][ci] Reduce usage of Simple CI #5546
[feature][bug][ci][api] Restruct upsample module #5524
[feature][ci] multi client launcher test #5488
[enhancement][ci] Remove automerge if cuda_new_interface failed #5519
[enhancement][ci] Prevent adding subdir in python/test #5514
[enhancement][ci] piprepo->pipindex #5517
[enhancement][ci] add dynamic_loss_scale in ci tests #5337
[enhancement][ci] Add timeout for wait_gpu_slot #5497
[enhancement][feature][ci] new static check based on clang-tidy #5476
[enhancement][ci] Fix url not downloadable in some browers #5701
[feature][ci] multi client multi machine test #5685
[enhancement][ci] Add cpu new interface CI #5639
[enhancement][ci][need-simple-ci] Mv clangtidy to simple ci #5667
[enhancement][ci][need-simple-ci] use clang tidy appimage in ci #5841
[enhancement][ci] Use gcc 7 in release to prevent error #5840
[enhancement][ci] bn tol 1e-4 => 1e-3 #5811
[enhancement][ci] fix distributed run on built dir #5810
[enhancement][ci] fix third party mirror check_sum #5802
[ci][documentation] find more accurately which files need to be doctested #5782
[enhancement][ci] Print stack unconditionally #5779
[enhancement][ci][need-simple-ci] Enable more checkers for clang-tidy in CI #5738
[enhancement][ci] CI: add clang-tidy check to test.yaml #5920
[ci][documentation] fix docstring in oneflow.nn.functional namespace #5807
[enhancement][ci] disable TREAT_WARNINGS_AS_ERRORS in Release CI #5886
[enhancement][ci] Skip ci jobs by git diff #5863
[bug][ci] quick fix #5978 #6030
[enhancement][bug][ci] fix clang tidy diff options and file format #5990
[enhancement][ci] add flow.relu #5847
[enhancement][ci] equal => allclose #6164
[bug][ci][need-simple-ci] CI: fix clang tidy checks in simple ci #6161
[enhancement][bug][ci][documentation][api] add interpolate and layer_norm docs #6157
[bug][ci] update speed test #6113
[enhancement][bug][ci][documentation][api] speed import oneflow #6107
[bug][ci] Also try install dev deps for CODEGEN_PYTHON_EXECUTABLE #6115
[bug][ci][need-simple-ci] set gtest_CMAKE_DEBUG_POSTFIX "d" #6085
[enhancement][ci] add cache init file for clang and CI build with clang #6062
[enhancement][ci] add emoji in speed test output, make it continue-on-error #6214

Test enhancements:

[bug][test][interface] Fix acos ci bug #5217
[feature][test] implement automated test #5321
[enhancement][test] move generator test into ops folder to accelerate tests #5472
[feature][test][api] Add autotest part2 #5467
[enhancement][test][api][interface] Add some tests with the new framework for auto testing #5561
[bug][test] fix test error when do multi case test on graph #5590
[enhancement][test] Refine module test using auto test by yaochi #5484
[enhancement][test] Add autotest for BatchNorm2d #5734
[enhancement][test] RTH_update_op_test #5823
[enhancement][test] dev adamw graph config #5745
[feature][test][api][interface] Add new autotest #5562
[bug][test] restore test of alexnet graph #5798
[enhancement][test][interface] add zhangshen op-test #5600
[feature][bug][tooling][test][interface] Record autotest wrong code #5923
[enhancement][feature][test][api] add randint #5718
[bug][test] fix multi machine test #5984
[enhancement][test][interface] some op test #6095

Tooling enhancements:

[bug][tooling] user/summary: fix memory leak in FillImageInSummary #5742
[enhancement][tooling][cfg] cfg: add move assignment operator for performance #5962
[enhancement][tooling][api][refactor] refactor_all_device_placement_api #6080

oneflow -

Published by jackalcooper about 3 years ago

Changelog

v0.5rc1 (13/09/2021)

Highlights

First class support for eager execution. The deprecated APIs are moved to oneflow.compatible.single_client
Drop-in replacement of import torch for existing Pytorch projects. You could test it by inter-changing import oneflow as torch and import torch as flow.
nn.Module for eager execution
nn.Graph for lazy execution
DDP for data parallel

A sneak peek of the new API

Here is a minimum example showcasing how to incorporate a nn.Module in a nn.Graph and have it run in lazy mode.

class NeuralGraph(flow.nn.Graph):
    def __init__(self, ...):
        super().__init__()
        self.model = model # model is a nn.Module instance

    def build(self, x):
        y_pred = self.model(x)
        return y_pred

graph = NeuralGraph() # to create a nn.Graph instance
y_pred = graph(x) # to run the created nn.Graph

New in Python API

[feature][eager][op][test][python][interface] Add test for convtranspose2d #5239
[enhancement][python][interface] Add GroupNorm #5175
[enhancement][eager][python][interface] [Add] avgpool1d avgpool3d #5165
[feature][eager][op][python][interface] Add deconv cpu impl #5224
[bug][eager][api][python][interface] Fix acosh bug #5221
[feature][eager][op][python][interface] Dev modules ctc loss #5168
[bottleneck][bug][documentation][python][interface] Fix meshgrid test bug #5208
[eager][documentation][python][interface] Rename CosineScheduler to CosineAnnealingLR #5112
[feature][eager][python][interface] Add meshgrid module #5205
[enhancement][feature][bug][op][python] support bias in conv2d's parameter list #5322
[eager][documentation][api][python][interface] add not_equal, greater_equal and less_equal module #5350
[enhancement][eager][python] refine pow module and its test #5319
[enhancement][eager][op][python] Add triu op #5329
[enhancement][bug][python] Fix optimizer for not supporting all kinds of iterables #5355
[bug][python][interface] raise IndexError in get_canonical_index to support for loop #5345
[bug][python][interface] tensor slice assign supports broadcasting #5344
[enhancement][op][python] add cpu group conv logic #5314
[enhancement][python] Add 'nn.Mish' module and corresponding functions #5310
[enhancement][build][python] Remove ONNX from setup py #5297
[enhancement][python][interface] [add] zeropad2d #5278
[feature][system][python][interface] Lazy nn.Graph FeedInputOpExpr #5458
[feature][python][interface] integrate nn.image.flip #5411
[bug][python] Fix issues in point of MultiClientSession #5469
[enhancement][bug][python] update HasAllMultiClientEnvVars() #5459
[enhancement][python] Add in_top_k function #5428
[enhancement][python] Dev add docstring #5449
[feature][api][python] MultiClientSession #5407
[documentation][python] remove --user #5431
[feature][python][interface] nn.Graph python #5309
[feature][python][interface] Fea/nn graph/graph name #5413
[bug][python][interface] rm nn.Graph.train #5424
[op][documentation][api][python][interface] add bernoulli module #5353
[enhancement][python] flow.S/B/P #5306
[enhancement][documentation][python] Add instruction on upgrade pip #5400
[enhancement][python] Rm oneflow export and experimental #5589
[bug][python] Fix nn.graph.utils module conflict #5598
[feature][ci][python] Update autotest framework #5520
[enhancement][python] copy of_proto_python_dir to compatible_single_client_python #5539
[enhancement][api][python] del default env init #5537
[enhancement][python] Fix single client using same glog file #5535
[bug][api][python] Fix Session TryClose #5531
[enhancement][feature][python] split vector-matrix norm #5478
[feature][eager][op][python][interface] Add more upsample kernel #5382
[enhancement][feature][test][python] add torchstyle unittest #5489
[feature][system][python] nn.Graph with training #5662
[enhancement][feature][python] Fea/nn graph/block proxy func #5727
[enhancement][api][python] consistent_tensor_to_api #5703
[feature][eager][op][python] Dev Align torch avgpool #5610
[enhancement][python] fix circular deps of sbp python module #5706
[documentation][python] [part5]Remove singleclient outdated api #5674
[enhancement][python] [part4]Remove singleclient outdated api #5672
[bug][op][python] remove outdated code in conv3d #5696
[enhancement][test][python] enlarge tolerance of dataloader test #5689
[enhancement][test][python] add autotest for some math ops #5646
[feature][python] nn.Graph optimizer part 2: add L2, pass job complete, refactor #5604
[enhancement][python] Add clip_grad_norm #5299
[purge][python] Remove Single-Client API in oneflow default python #5827
[bug][python] Fix ddp grad size #5834
[enhancement][feature][python] Dev RMSprop graph conf #5768
[enhancement][purge][eager][python] remove scale arg in optimizer #5821
[enhancement][feature][python] graph/block io check #5803
[enhancement][feature][python] Dev adam graph conf #5709
[purge][python] [part10]Remove singleclient outdated api #5756
[feature][api][python] better repr of nn.Graph for debug #5762
[bug][python] fix weight decay in RMSprop #5755
[purge][python] [part9]Remove singleclient outdated api #5752
[purge][python] [part8]Remove singleclient outdated api #5750
[documentation][python] add first batch of methods in oneflow.nn.functional namespace #5693
[purge][python] [part6]Remove singleclient outdated api #5704
[bug][python] use default_generator.seed() as random_seed in init #5721
[bug][system][python] ddp broadcast params and buffers #5913
[enhancement][test][python] Add consistent tensor requires grad test #5925
[bug][python] wrap flow.nn.init.* with flow.no_grad() #5932
[feature][api][python][interface] add clip_grad to optimizer #5817
[enhancement][ci][op][test][python] add randperm with test and docs #5680
[feature][api][python] Fea/nn graph/ lr_schedule(and cosine lr_sch) and opt_group #5846
[bug][python] fix bug of SyncOnMasterFn atexit #5909
[purge][python] Delete single client nn modules #6061
[enhancement][python] Move framework.distribute to env #6022
[bug][python] skip sync when abnormally exiting #6025
[feature][python] Fea/nn graph/warmup amp config #5969
[documentation][python] add optimizer api docs #6131
[documentation][python] add_tensor_api_doc #6127
[bug][python] Fix test_grid_sample.py and test_affine_grid.py threshold #6125
[documentation][api][python] add doc of graph #6093
[bug][python] Fix make of_format fail in ubuntu #6120
[feature][api][python][interface] Fea/graph helpers #6088
[enhancement][eager][python][interface] Use flow.randint in dataloader #6086
[feature][eager][api][python][interface] Import oneflow as torch #6076
[enhancement][test][api][python][refactor] rename OfrecordReader to OFRcordReader #6090
[purge][python][need-single-client-tests] Delete single client nn modules #6082
[enhancement][python] flow.load tolerates FileNotFound fault #6083
[feature][python] Fea/pipeline in graph #6105
[enhancement][test][python] graph activation checkpointing #6192
[enhancement][feature][op][python] rnn test #6165

New in Ops:

[enhancement][op][api][refactor] [Functional] Part2: Add partial unary and math functional apis #5218
[enhancement][bug][op][interface] Refine deconv kernel #5229
[enhancement][op][api][interface] add ReflectionPad2d #5172
[feature][eager][op][api][interface] crossentropyloss and nllloss support ignore_index #5195
[feature][eager][op][api][interface] Yejiaojiao/dev bcewithlogitsloss #5173
[bug][ci][op] Dev user op set default is_dynamic #5223
[enhancement][op] add magic method for pow #5199
[enhancement][op][interface] add cpu version of upsampling #5194
[enhancement][bug][op][api][interface] add ReplicationPad2d #5148
[feature][eager][op][api][interface] add kldivloss module #5155
[feature][eager][op][documentation][build][api][interface] Add floor module and the corresponding testcases #4964
[enhancement][feature][op] Dev conv1d module #5280
[enhancement][op] Add ctc_greedy_decoder op #5294
[enhancement][op][system] Dev remove default grad func #5320
[enhancement][op][system] Add pad grad func. #5354
[enhancement][op][system] Add gradient funcs. #5348
[feature][purge][bug][eager][op][interface] fix upsample nearest bug #5347
[enhancement][op][system] [Functional] Part7: Migrate pooling ops #5253
[enhancement][op] nvjpeg hardware acc #5240
[enhancement][feature][ci][eager][op][api][interface] Add bmm module #5334
[enhancement][eager][op] Dev image decode eager #5333
[enhancement][op] Optimize softmax warp impl #4977
[enhancement][eager][op] Dev tensor buffer eager #5317
[enhancement][op][api][refactor] [Functional] Part6: Migrate conv op #5252
[enhancement][eager][op] Dev sort eager #5284
[enhancement][bug][op][api] fix bceloss bug in default weight and reduction #5303
[bug][eager][op] remove redundant assert and check #5264
[enhancement][bug][ci][op] fix bceloss bug about weight #5269
[enhancement][op][api][refactor] [Functional] Part5: Migrate nn ops #5249
[enhancement][eager][op] Dev argsort eager #5273
[enhancement][op][api][refactor] [Functional] Part4: Migrate array ops #5247
[enhancement][op][api][refactor] [Functional] Part3: Migrate binary and activation ops #5246
[bug][ci][op][test] Dev fix rmsprop ci fail #5481
[enhancement][op] add inplace method: Tensor.sin_ #5471
[bug][op] hotfix image_batch_align #5461
[enhancement][eager][op][interface] Dev maxpool series op 123d #5244
[bug][op] fix pool gpu kernel #5446
[feature][eager][op][api][interface] add pixelshufflev2 module #5383
[enhancement][feature][ci][eager][op][documentation][api][interface] Add flow xxx and tensor xxx autotest #5386
[enhancement][feature][eager][op][api][interface] Modules chunk #5324
[enhancement][eager][op] add image normalize for eager #5402
[enhancement][eager][op] Dev batch align module #5401
[enhancement][eager][op] add coco reader module #5391
[enhancement][wip][op] Restruct Elementwise kernel #4130
[bug][op] Fix DecodeRandom reuse mem #5606
[enhancement][op] Align pytorch maxpool #5525
[enhancement][bottleneck][eager][op][api] implementation of constantpad-3d op #5529
[enhancement][eager][op] Add scale size for resize #5509
[enhancement][op][api][refactor] Dev optimize tensor setitem #5501
[enhancement][op] register uint8 dtypeto support dataloader #5499
[enhancement][op] Add unique.cuh #5487
[enhancement][op][api][interface] Dev ofrecord auto truncating #5412
[feature][op][system][interface] Feat: LazyInterpret::ApplyImpl support SourceUserOpExpr and Copy #5711
[enhancement][op][interface] Dev logical_and/or modules #5636
[enhancement][op] support any number positional arguments for ones and zeros op #5698
[enhancement][feature][eager][op] Add conv3d Module #5327
[feature][eager][op][api][interface] add batchnorm3d module #5631
[bug][eager][op] fix reduce min max backward bug #5651
[enhancement][op] Debug dim scatter #5371
[enhancement][op][interface] Dev eye #5583
[enhancement][eager][op] Dev minimum maximum #5576
[enhancement][op] Restruct activation grad op #5669
[enhancement][feature][eager][op] Rewrite activation function #5465
[bug][op][documentation] add oneflow.cat for documentation #5621
[enhancement][op] Lcy logsoftmax #5746
[feature][op][need-simple-ci] Feat empty op #5659
[enhancement][eager][op] Dev split #5714
[enhancement][op][interface] add index_select op #5661
[bug][op] fix nvjpeg hw acc #5851
[enhancement][op] Remove move in conv_cudnn #5828
[enhancement][op][interface] Dev logical_xor module #5694
[bug][eager][op] fix squeeze #5808
[enhancement][op] Get parallel_id and parallel_num through rank and world size in DDP #5717
[bug][eager][op] delete interpolate int type #5805
[bug][op] Fix bug in scatter #5743
[enhancement][op] Refactor: remove module not required, call function directly #5754
[enhancement][op] Remove modules not required(tan, erfc, log1p, scatter_nd) #5791
[enhancement][op] Refactor scatter, clamp and pow in cpp instead of in python #5715
[enhancement][op] Rm useless code in gather files #5687
[enhancement][eager][op] change flip_code to scalar #5786
[enhancement][bug][op][interface] fix upsample bug #5753
[bug][op][interface] Quick fix Lazy nn.Graph input/output OpConf.BlobConf.is_dynamic #5767
[enhancement][bug][eager][op] fix argwhere 0-dim bug #5760
[enhancement][eager][op] delete unused code #5744
[feature][op] Export fused_scale_tril op #5933
[bug][op] Fix backward bug in 3d #5908
[bug][op] Fix one_hot api limit #5927
[enhancement][eager][op] Dev where scalar #5797
[bug][op] fix grad error #5914
[feature][bug][op] Fix inplace op circle reference bug #5910
[enhancement][op] Move the judgment content to c++， And add scalar fmod #5854
[enhancement][op] Support combined_margin_loss op in flow.nn.modules #5830
[enhancement][op][api][interface] functional_one_hot #5315
[enhancement][op] Dev scalar op #5778
[bug][eager][op] fix gather kernel 0 shape #5888
[enhancement][op] add l2_normalize for mutl-client interfaces #5859
[feature][op] Export function softmax_cross_entropy #6056
[enhancement][op] Add int attr for functional adaptive average pool #6059
[enhancement][op][interface] dev full op #5955
[bug][eager][op] fix 0dim inplace add #6029
[feature][op][system][interface] Feat: nn.Graph image gpu decoder #6014
[enhancement][op][interface] dev optim_optim_lr_scheduler_multisteplr #5975
[enhancement][op] NopKernel #6035
[enhancement][eager][op][api] Dev tril op #6005
[enhancement][op] dev unfold and fold #5675
[enhancement][op] ResNet CUDA Graphs #6018
[enhancement][feature][op] add broadcast pow #6013
[enhancement][op][interface] init of op diag #5298
[op][documentation][api] Fix api document bug #6009
[enhancement][op] Dev fused functional #5954
[bug][op][build] Add nvcc flag -Werror cross-execution-space-call #6002
[bug][op] Fix Normalization grad function #5993
[enhancement][feature][eager][op][test][interface] Add fused self attention #5966
[enhancement][bug][ci][eager][op][api][interface] Try to fix var bug #5973
[enhancement][feature][eager][op][interface] add prod op #5867
[enhancement][eager][op][api] add glu op #6065
[enhancement][op] Align Torch.nn.functional poolXd #6184
[bug][eager][op] fix backward index for gamma beta #6149
[bug][op][system] Fix BroadcastMatmulGrad bug #6168
[enhancement][op][api] Add Int support for functional.avg/maxpool #6174
[bug][eager][op][api][interface] align dropout api name with pytorch #6170
[enhancement][op] support inplace operation for hardsigmoid #6137
[enhancement][bug][op] Fix do bias correction in Adam/AdamW #5960
[bug][eager][op][api][interface] fix repeat 0-dim tensor bug #6150
[enhancement][bug][op] Fix select_first_grad bug #6142
[bug][ci][eager][op][documentation][interface] Add clipgrad doc and contiguous #6130
[bug][op] Fix eager optim dynamic attr bug #6111
[enhancement][op] Support grid_sample and affine_grid operator #6038
[op][documentation] Export apis for documentation #6068
[enhancement][feature][bug][ci][eager][op][documentation][interface] transfer python function to c++ method #6114
[op][documentation] Dev functional batch_gather #6233
[enhancement][op][test] fix cross_entropy_loss and its test #5799
[bug][op] Use attr nd_sbp to check consistent #6222
[enhancement][op] Dev fused bn functional #6077
[enhancement][op] support default value in intlist #6201
[bug][op] fix sparse_softmax get_nd_sbp #6203
[bug][op] Fix bug in model fused update #6197
[enhancement][op][system][refactor] Optimize tensor getitem. #5433

New in Eager:

[enhancement][eager][interface] Reconstruct module files #5251
[bug][eager][documentation][interface] Fix conv module bug #5245
[bug][ci][eager][interface] Fix bce withlogitloss ci error #5237
[feature][eager][api][interface] module BCELoss #5144
[enhancement][feature][eager][api][interface] Dev norm op #5178
[enhancement][bug][eager] Fix stack module #5222
[enhancement][feature][eager] Support different dtype of equal module #5214
[enhancement][bug][eager][documentation][api][interface] Add nllloss backward #5210
[enhancement][eager][api][upload-core] Decouple FileSystem and IOConf #5162
[enhancement][ci][eager] Set lower precision avoid ci failing #5200
[eager][documentation] Add hint when apply FunctionNode second time #5369
[enhancement][feature][bug][ci][eager][documentation][api] Fix upsample bilinear bug #5366
[bug][eager] Fix not contiguous ndarray to tensor bug #5351
[enhancement][eager][system] Infer consistent tensor meta #5118
[feature][eager] Feat graph autograd engine #5296
[enhancement][eager][interface] Dev type as module #5349
[feature][eager][documentation][api][interface] Add new ones module #5342
[enhancement][bug][eager] Fix logical slice assign dtype #5339
[bug][ci][eager][documentation][api][interface] Fix where module bug #5300
[bug][ci][eager][documentation][api] Fix l1loss ci error #5307
[enhancement][bug][eager][documentation][api][interface] Qi's First Edit of deleting "print" and ".numpy" #5129
[feature][eager][refactor] Separate autograd meta to tensor #5267
[feature][eager][api][interface] add tile module #5234
[enhancement][eager] Release lambda function to reuse tensor memory #5266
[feature][bug][eager][documentation] Fix default value not set bug #5483
[enhancement][eager][interface] [Add] gather_nd scatter_nd #5422
[enhancement][bug][eager] fix param #5473
[bug][eager] Fix Tensor.grad setter bug #5462
[enhancement][eager] Rename now_grad_arg to current_grad #5466
[eager][test][documentation][interface] Add autotest part1 #5436
[enhancement][eager] Use functional copy instead of op_builder #5460
[bottleneck][bug][eager][interface] fix -1 index not support bug #5448
[bug][ci][eager][documentation][api] Fix concat backward bug #5443
[enhancement][bug][ci][eager] Add autograd engine warning #5444
[feature][eager][api][interface] Smoothl1loss #5256
[enhancement][bottleneck][eager] remove device dtype params #5434
[bug][ci][eager][documentation][interface] Delete maxpool failed test #5409
[enhancement][eager][api] Add tensor grad assginment #5379
[enhancement][bug][eager] fix-abs #5398
[enhancement][bug][eager][interface] Fix bn track running stats #5393
[enhancement][bug][eager] Support uint dtype of constant op #5396
[enhancement][bug][eager][documentation][interface] Delete useless code upsample #5392
[enhancement][ci][eager][interface] add flow.view #5301
[enhancement][bug][ci][eager][api][interface] Add masked select module #5356
[bug][eager][interface] Fix batchnorm backward bug #5602
[enhancement][eager] Support weight_dacay(l2 actually) #5587
[feature][eager][documentation][api] Add new autotest #5588
[enhancement][eager][documentation][api] Dev fmod #5404
[feature][eager] Support inplace add #5432
[feature][eager][interface] Feat tensor stride property #5543
[enhancement][feature][eager][documentation][api] Add flip module #5541
[feature][eager] Feat module repr #5486
[enhancement][bottleneck][bug][eager][interface] Fix maxpool1d params #5493
[enhancement][feature][eager][interface] Dev flow.utils.data part1 #5406
[bug][eager][api] Fix tensor getitem bug #5474
[enhancement][eager][need-simple-ci] export datasets interface #5691
[enhancement][eager][system] rebase #5601
[enhancement][eager][test] added nn.RecordBytesDecoder with its test #5475
[enhancement][feature][eager][need-simple-ci] 0-dim tensor support #5552
[enhancement][bug][eager] rewrite slice_update backward #5677
[enhancement][bug][eager][interface] align view input style with torch #5676
[enhancement][eager][interface][need-simple-ci] add autotests for modules #5666
[enhancement][bottleneck][eager][interface] Dev constantpad1d op #5579
[enhancement][eager][api][interface] Restruct MathOps AutoTest #5654
[enhancement][bug][ci][eager] Fix flip bug #5657
[bug][eager][api][interface] Fix expand module bug #5650
[enhancement][bug][eager][documentation][api] Fix repeat bug #5633
[enhancement][eager][test][api][interface] Add new autotest #5617
[enhancement][eager][api][interface] Dev flow.utils.data part2 #5500
[enhancement][bug][eager] make setitem device match #5835
[bug][eager][api][interface] align reshape input param with pytorch #5804
[feature][bug][eager][api] Align where op with torch #5850
[enhancement][bug][eager][api] Restruct prelu op #5829
[bug][eager][need-simple-ci] fix pooling ceil_mode bug #5818
[enhancement][eager] stateful local kernel supports consistent #5789
[bug][eager][api][interface] Fix argwhere bug #5816
[enhancement][eager][documentation][api] dev-nonzero #5809
[enhancement][feature][eager][api] Add fake quantize op #5690
[enhancement][bug][eager][documentation][api] Add api #5663
[enhancement][eager] Refactor consistent infer result #5790
[bug][eager][need-simple-ci] skip dataloader test #5780
[bug][eager][need-simple-ci] fix 0-dim tensor.fill_ #5771
[enhancement][eager] Cpu mpi broadcast #5726
[feature][eager] Feat grad mode classes #5956
[enhancement][bug][eager] fix wrong names #5951
[enhancement][eager][system] Local dep object pool #5953
[enhancement][eager][interface] rename OpExprInterpState to AutoGradCaptureState #5918
[bug][eager] Fix linear bug #5945
[bug][eager] Fix tensor_meta update bug #5924
[enhancement][eager] use flow.randperm #5928
[enhancement][eager] consistent init/save/load #5896
[enhancement][bug][eager][documentation][interface] Restruct sort and argsort op #5911
[enhancement][bug][eager][interface] Try to fix the problem that the insightface cannot converge。 #5906
[enhancement][bug][eager][interface] Add autotest #5899
[enhancement][eager] The scheduler thread joins worker threads #5893
[enhancement][eager] Bugfix async callback #5881
[feature][eager] Feat tensor to bool #5836
[bug][eager] Remove inplace broadcast_add #5551
[enhancement][eager] Broadcast consistent shape and dtype #5784
[enhancement][eager] Fix optimizer list parameters input bug #5848
[enhancement][eager][interface] Dev flow.utils.data part3 #5644
[enhancement][eager][api] Normalize naming of modules #6066
[enhancement][feature][eager][api][interface] add truncnormal #6051
[enhancement][bug][eager] AutoMatedTest support test module.parameter.grad #6043
[enhancement][feature][bug][eager] add module call kwags #6069
[enhancement][eager][api][interface] add tensor.item tensor.tolist #6021
[enhancement][eager][api][interface] Export pool ops api #6047
[enhancement][bug][eager][test][documentation][interface] Add more autotest sample #6039
[enhancement][bug][eager][system] disable cuda_h2d stream #6020
[feature][eager][test][api][interface] Add autotest codegen #6019
[feature][eager][documentation] Refactor cosine lr scheduler #6000
[enhancement][eager][interface] tensor.cpu/tensor.cuda #5894
[enhancement][eager][api] Support consistent_tensor.to(dtype) #5991
[bug][eager][interface] remove redundant codes in ModuleDict #5961
[bug][eager] Fix LayerNorm check bug #6196
[enhancement][eager][api] Change dropout api #6182
[enhancement][good for pr][eager][api][interface] add: test convert dependency #6023
[enhancement][bug][eager][interface] Fix autotest codegen bug #6171
[bug][eager] restore instr_local_dep_object_pool_size for nccl #6160
[enhancement][eager][api][interface] Aligin pooling op functional api names with torch #6163
[feature][bug][eager][api][interface] delete file #6162
[bug][eager] Fix optim load_state_dict bug #6152
[enhancement][eager][api] add is_training to dropout functor #6148
[enhancement][eager] Decompose nd sbp boxing #5800
[enhancement][eager] support consistent_tensor.to(copy=True) #6122
[feature][eager] Static grad scaler #6135
[bug][eager] Fix LayerNorm expr bug #6121
[bug][eager][api] move numpy c api init in numpy.cpp, make np array contiguous before copying #6117
[enhancement][eager][refactor] Remove params from ParamGroup getitem #6096
[enhancement][feature][eager] Support tensor and optimizer serialization #6087
[enhancement][bug][eager] fix bug about tensor str in nonsymmetric cast and getitem in consist… #6239
[enhancement][eager] Cpu all reduce #5849
[feature][eager] Support assign copy interface #6228
[enhancement][eager][api][interface] Dev reconstruct pad ops #6223
[enhancement][eager][api][interface] support flow.cuda.is_available #6124
[bug][eager] make flow._C.local_all_reduce sync lanuched #6175
[enhancement][eager] Rename flow to oneflow in user hint #6190
[bug][eager][tooling][test][api][interface] Autotest generate input tensor #6206
[enhancement][eager] consistent tensor zeros_() #6202
[enhancement][eager] Cpu mpi #5865

Build enhancements:

[bug][build] Fix GRPC compilation failure on CMake 3.20 #5255
[bug][build] Refine header file copy #5254
[bug][build] Fix older version CMake doesn't support multiple targets in CLI #5248
[bug][build] Turn off NCCL_STATIC/CUDNN_STATIC when CUDA_STATIC is OFF #5243
[feature][build] Fix support for Ninja and add Ninja build in Simple CI #5236
[enhancement][build] Add cmake option CUDA_STATIC #5164
[bug][build] Fix protobuf debug postfix #5233
[enhancement][ci][build] Move default third party dir into build dir #5230
[enhancement][build] Refine protobuf cmake #5216
[enhancement][ci][build] Remove transport test main #5215
[enhancement][ci][build] Speedup opencv build #5213
[enhancement][build] Support clang #5015
[enhancement][documentation][build] Add prefix when creating git archive #5201
[enhancement][build] Add cmake option NCCL_STATIC #5160
[enhancement][build] Refine CMake CUDA version handling #5192
[enhancement][build] Use clang plugin to check Maybe variables are used #5358
[enhancement][build] Add BUILD_BYPRODUCTS for ExternalProject_Add #5316
[enhancement][build] Add cmake init cache to simplify user onboarding #5311
[feature][bug][build] Fix macOS support and run macOS build in Simple CI #4947
[enhancement][build] flatbuffers use mirror #5295
[enhancement][build] Don't build test by default #5302
[enhancement][build] Prevent building from scratch when toggle flag BUILD_GIT_VERSION #5259
[enhancement][build] Refine gRPC, glog, gflags cmake for conda #5276
[feature][build] Support XLA with CPU-only #5260
[enhancement][ci][onnx][build] Remove ONNX from CI #5257
[enhancement][build] Refactor build_wheel to support oneflowinc images #5427
[enhancement][build] Add arg skip_audit in build wheel #5423
[bug][build] hwloc disable shared #5388
[documentation][build] Update readme for autoconf and libtool #5376
[enhancement][build] remove dir python and compatible_single_client_python #5609
[bug][build][system] Fix pyyaml version #5594
[enhancement][ci][build] force release flags #5574
[bug][build] prevent endless loop #5534
[enhancement][build] Support sccache #5528
[enhancement][build] Add definition for CMAKE_BUILD_TYPE and print cmake_build_type in oneflow doctor #5505
[enhancement][ci][build][need-simple-ci] Fix macOS for recent changes #5705
[bug][build] fix return type error on gcc 4.8.5 #5660
[enhancement][build] Check CMAKE_BUILD_TYPE #5656
[enhancement][build] add -Werror=return-type #5655
[enhancement][build] Clean and fix for new py dir #5618
[enhancement][build] cmake: disable array-bounds check & treat warnings as errors for pyextobj and oneflow_internal & fix warnings #5838
[bug][build] set CMAKE_BUILD_TYPE to Release if undefined #5842
[enhancement][build][need-simple-ci] Fix all warnings & Add option TREAT_WARING_AS_ERROR to cmake #5751
[enhancement][build] add CMAKE_INTERPROCEDURAL_OPTIMIZATION in fast cmake cache #5970
[enhancement][build] add clang tidy target #5957
[bug][build] cmake: fix cmake cache args in opencv #5959
[enhancement][build] Add cmake option USE_SYSTEM_NCCL #5897
[enhancement][build] cmake: include third party headers as system headers to avoid warnings #5879
[enhancement][build] Ignore opencv-python on machine aarch64 #5884
[enhancement][build] enable CMake first class cuda support #5858
[bug][build] Fix compile warning (strict-aliasing) #5872
[enhancement][bug][build][need-simple-ci] Upgrade gtest and fix some errors raised by clang #6079
[bug][ci][build] cmake: fix ninja build in CI #6072
[bug][build] fix files not actually removed when building for multiple python versions #6060
[bug][build][api] functional_api: fix build error in mac os #6010
[bug][build][need-simple-ci][need-single-client-tests] Fix recompile from scratch #6036
[bug][build] Turn on NVCC's warnings #6011
[bug][build][need-single-client-tests] fix bundle .so of other python version #6034
[bug][ci][build][need-single-client-tests] use copy_all_files_in_dir to replace copy_files #6033
[enhancement][build] check compiler version in cmake #6026
[enhancement][build] Add CUDA_NVCC_THREADS_NUMBER #6017
[enhancement][build][need-simple-ci] optimize of_include_copy #5978
[enhancement][ci][build][need-single-client-tests] CI: remove -DTREAT_WARNINGS_AS_ERRORS=OFF #6008
[enhancement][build][xla] xrt: fix all warnings #5915
[enhancement][build] Prevent opencv compile failure with std 17 #5997
[enhancement][build] Use bundled cub #5998
[enhancement][ci][build] update clang tidy diff warnings-as-errors option #5989
[enhancement][build] Update run_clang_tidy.py to set return code and add warning-as-errors #5977
[enhancement][build] check: fix clang-tidy-diff commands #5972
[bug][build] Suppress NVCC warning #177-D #6094

XLA enhancements:

[bug][xla] Make the blob header memory aligned. #5286

System:

[enhancement][system] Refactor Memory Zone #5072
[enhancement][system] Add interface InferContext::OutputTensorDesc #5219
[bug][system] Lazy construct functor to make sure that the operators has already been registered. #5225
[enhancement][system] Refactor infer ctx output isdynamic #5220
[enhancement][system] Refactor infer ctx input isdynamic #5211
[enhancement][system] Wake up the heartbeat thread immediately #5081
[enhancement][system] Fix xla test case fail #5203
[enhancement][system] Add interface InferContext::InputDType #5153
[purge][system] delete const_cast in Output #5196
[feature][system] Add hwloc for topology detection #5291
[enhancement][system] fix registry may segment #5336
[enhancement][system] Use functional api instead of op_expr_helper::XXXOp. #5364
[enhancement][system] move btob to op #5274
[documentation][system] Add Latest News section in README #5361
[enhancement][bug][system] fix dropout module: return directly if not training #5346
[bug][system] add missing JUST #5357
[documentation][system] Add more communication outlets on README #5359
[enhancement][feature][system] CommNet dynamic register memory #5281
[enhancement][system] Use symbol device #5341
[enhancement][system] fix multithread bug in env #5283
[bug][system][api] fix bug in cfg_replacement #5335
[bug][system] Fix create log directory thread-unsafe #5326
[bug][system] fix_bug_in_make_parallel #5328
[enhancement][system][cfg] replace train_conf, job_conf using cfg::xx #5263
[enhancement][system][quantization] support tensorrt in qat #5287
[enhancement][system][api] Export functional apis for oneflow.experimental. #5313
[enhancement][system] fix bug check between cfg enum and proto enum #5285
[enhancement][system] replace CHECK_EQ using CHECK_EQ_OR_RETURN #5279
[enhancement][system] Refactor SbpXXX to cfg::SbpXXX #5120
[enhancement][system][api] add detach for LazyMirroredtensorImpl #5270
[enhancement][system] shorten XXIsDynamic4ArgNameAndIndex to be xxIsDynamic #5265
[enhancement][system][cfg] job_config to cfg #5235
[feature][system] Multi-Client LogicalRun degenerate to PhysicalRun #5479
[enhancement][system] fix ConstructOp without JUST #5480
[enhancement][system] Output arg modifier return maybe part 1 #5451
[feature][system][interface] Fea/nn graph/graph build ctx #5420
[enhancement][system] Throw exception if check failed #5457
[feature][system] multi client launch #5372
[enhancement][system][api] Optimize reduce mean #5452
[enhancement][system] export Tensor only to python #5440
[enhancement][system] Output arg modifier return maybe part_0 #5447
[enhancement][system] ThreadMgr support AddPlan #5450
[enhancement][system] Refactor infer ctx input tensordesc #5226
[enhancement][system][api] instruction builder return maybe #5442
[feature][system][interface] MultiClientSessionContext #5421
[enhancement][feature][system] add launcher, update multi client launch and exit #5414
[purge][system][refactor] Remove IOConf #5419
[enhancement][system] Dev refine generator #5426
[enhancement][system] Support inplace operations #5204
[enhancement][system][refactor] Dev refactor generator #5397
[enhancement][system] Add new placement init func #5408
[enhancement][system] NNGraphIf #5387
[enhancement][system][refactor] Cast explicitily in unpack call to avoid confilt with Optional. #5380
[enhancement][system][interface] [Random Generator] Part2: Migrate functional dropout #5378
[enhancement][system] replace ForeignJobInstance using JobInstance #5374
[enhancement][system][refactor] Speedup reshape module by 5x. #5381
[feature][system][interface] [Random Generator] Part1: Dev random generator #5360
[enhancement][system] Add ONEFLOW_STREAM_CUDA_EVENT_FLAG_BLOCKING_SYNC #5612
[enhancement][system] [part2]Remove singleclient outdated api #5568
[feature][system][interface] nn.Graph call and launch impl #5580
[enhancement][system] remove outdated doctest api and "@experimental_api" #5564
[feature][system][interface] Register ForeignCallback and Watcher in Multi-Client #5591
[enhancement][system] [Part-1]remove outdated api and files of multi-client on master branch #5556
[feature][system][interface] LazyInterpret build LocalTensor if input is local #5582
[enhancement][system] add job_pass MultiClientAutoSourceAndSinkTick #5507
[feature][system] Fea/nn graph/optimizer #5533
[feature][system][interface] New/CloseRuntimeBuffers and RunLazyJob impl #5571
[feature][system][refactor][interface] NNGraph interface and implement for CompileAndRuntime #5558
[feature][system] Fea/nn graph/forward graph #5516
[enhancement][system] Lazy job stream type #5389
[enhancement][system] Refactor single client autotick #5506
[enhancement][system] replace underline using dot in single client #5547
[bug][system] fix return type #5548
[feature][system][interface] LazyInterpret for UserOpExpr #5544
[enhancement][system] Add ProfilerStart/ProfilerStop API #5542
[feature][system][interface] LazyInterpreter for FetchOutputOpExpr and set op parallel_distribution #5527
[enhancement][system] Multi client push pull #5492
[enhancement][system] registry_callback_fn return maybe #5456
[enhancement][system] bw_gen_fn return maybe #5455
[enhancement][system] gen_bw_fn return maybe #5454
[enhancement][system] Compatible single client #5417
[feature][system][interface] GlobalMultiClientEnv and refine EagerExecution #5523
[enhancement][system] Job pass maybe system #5503
[enhancement][system] Remove Plan::net_topo #5502
[feature][system][interface] LazyInterpret for FeedVariableOpExpr #5490
[enhancement][system] Input arg modifier return maybe #5453
[feature][system][interface] Fea/nn graph/block scope #5498
[feature][system] jit_fuse_cast_scale #5332
[enhancement][system] Remove obsolete Profiler #5747
[enhancement][system][api] Dev fix batch norm not stats #5733
[enhancement][system] rename rpc_token to TransportToken #5735
[enhancement][system][api] Refacotr maximum minimum py2cpp #5724
[enhancement][system] Replace piece_id with comm_net_sequence_number #5731
[enhancement][system] beautify stack frame #5686
[enhancement][system] Add env ONEFLOW_KERNEL_DISABLE_BLOB_ACCESS_CHECKER #5728
[enhancement][system] Add env ONEFLOW_THREAD_ENABLE_LOCAL_MESSAGE_QUEUE #5720
[enhancement][system][api][refactor] Refactor functional sub, mul and div apis #5713
[feature][system] ddp #5008
[enhancement][system][api][refactor] Refactor functional matmul and add apis. #5697
[bug][system] Fix ClearKV("plan") #5710
[enhancement][system] Rename cpu to async cpu #5712
[enhancement][system] Support tensor.to()/to_local() #5271
[feature][system][refactor][interface] Multi-Runtime for multi nn.Graph #5683
[bug][system][refactor] Add tag for Optional inplace constructor #5619
[enhancement][system] Move Global to env scope #5670
[enhancement][system] add JUST wrapper #5681
[enhancement][system] New sync consistent meta info #5634
[enhancement][system][refactor][interface] Refactor RuntimeCtx for multi-runtime #5664
[feature][system][interface] Feat: memory shared between EagerTensor with VariableRegst #5649
[enhancement][system] Use functional call directly instead of construct a module and then call-Add #5613
[enhancement][system] disable eager_op consistent mode #5647
[enhancement][system] add msg_penddin_list in ibverbs_qp to optimize qp_init_attr.cap.max_send_wr #5485
[enhancement][system] IBVerbsCommNet add knobs #5626
[enhancement][system] Prune python tensor #5596
[feature][system][interface] Feat: LazyInterpret infer op / tensor ParallelDescScope #5625
[enhancement][system] Replace src tick with with wait and send ids #5603
[enhancement][system] Support symbol placement type in functional. #5627
[enhancement][system][api][refactor][interface] Dev advanced indexing #5559
[enhancement][system] Optimize maybe. #5839
[enhancement][system] Decorator 4 disable recursive boxing call #5796
[enhancement][system] add_eager_boxing_and_op_interpreter_dispatch_error_info #5819
[enhancement][system] Kernel CUDA Graphs Support #5725
[bug][system] Fix placement print bug #5853
[bug][system] when error msg formatting fails, return error->DebugString #5844
[enhancement][system][refactor] Rename variables named *parallel_distribution* to *nd_sbp* (1) #5815
[feature][system][interface] Support Free EagerTensor caught in nn.Graph build #5777
[enhancement][system] Reuse CUDA event / Refine BnInOp2Blob / Refine channel #5837
[enhancement][system][serving] fix bug in AddInputOutputOpsPass: check existence of key in HashMap(inferface_lbi2scope_sym_id) #5653
[enhancement][system][api] unpack_call: impl new unpack_call_dispatcher for better performance #5820
[feature][system] Feat consistent tensor python constructor #5812
[feature][system] Support 0shape tensor #5620
[documentation][system] fix launcher description #5770
[feature][system][interface] Multi-nn.Graph memory reuse by Chunk manager #5658
[bug][system] Fix naive b2p error #5806
[enhancement][system] set created generator with default rng seed #5801
[enhancement][system] enhance_local_to_consistent #5761
[feature][system] add flow.randn #5736
[enhancement][system] Refactor hierarchical parallel cast autograd #5764
[enhancement][system] Collective boxing executor add_plan delete_plan #5495
[enhancement][system] Fix throw abort #5795
[enhancement][system] DECORATE #5794
[enhancement][system] Inferface eager boxing #5682
[enhancement][system] extract_consistent_to_consistent_op_expr #5870
[enhancement][system] disable backward pass consistent tensor meta check. #5871
[enhancement][system] Add CudaStreamIndexGenerator::GenerateNamedStreamIndex #5940
[bug][system] Only query PCI bus id when CUDA version >= 11 #5937
[enhancement][system] maybe: add JUST_MSG and CHECK_JUST_MSG #5904
[bug][system] Fix bug scalar #5950
[enhancement][system] framework: fix rvalue reference warnings #5948
[purge][system] Remove CudaWorkType #5942
[enhancement][system] refactor_symbol #5941
[bug][system] consistent_tensor_infer_cache: fix memory leak #5938
[feature][system] support to print gpu #5936
[enhancement][system] Bugfix static check #5935
[bug][system] fix nccl_version log #5934
[bug][system] Fix bug of multi-GPU train nn.Graph extra mem cost in rank 0 #5930
[enhancement][system] Only gradient acc be scheduled in parallel. #5926
[enhancement][bug][system] fix_ddp_bug_on_8_process #5929
[enhancement][system] Fix bug error msg format #5866
[feature][system] print consistent tensor data #5902
[bug][system] Move parse env to the constructor #5922
[enhancement][system] Remove GlobalWorkStreamId/GlobalThrdId #5917
[bug][system] shared_or_scalar: fix alias warnings #5916
[purge][system] Remove CompActor #5919
[enhancement][system] Use symbol dtype #5641
[enhancement][feature][system] Control Graph / Session / Env's python c++ object destruction #5845
[enhancement][bug][system] Sync access and assign indexing tensor. #5907
[enhancement][system][api][refactor] Dev consistent arange #5883
[enhancement][system] Lazy interpreter for new ConsistentToConsistentOpExpr #5903
[bug][system] Fix BUG of LazyInterpret FreeEagerTensor memory shared with regst #5891
[bug][system] fix typo in raise RuntimeError #5890
[enhancement][system][refactor] Rename the ParallelDistribution class to NdSbp #5814
[feature][system] add flow.rand #5722
[feature][system] Lazy Interpret support infer default device cpu #5880
[enhancement][system] Tensor str #5783
[feature][system][interface] Lazy to_consistent #5774
[enhancement][system] wait vm empty before exiting #5860
[enhancement][system] Eager boxing n to 1 #5949
[enhancement][system] add kernel observer #6052
[enhancement][ci][system] Optimize ddp broadcast and add speed/memory test in ci #6044
[enhancement][system] add var to control only print warning once when blocked #6045
[enhancement][system][refactor] Rewrite pow and logical functional apis #6032
[enhancement][system] Token seq id #5964
[enhancement][documentation][system] Remove python function wrapper. #6012
[feature][system] Add timeout and loc for blocking calls #6007
[enhancement][system] Eager boxing 1 to n #5943
[enhancement][system] Boxing expr #6015
[enhancement][system] new_X_to_B #5987
[enhancement][system] Add unimplemented return information #5952
[enhancement][system] Revert "Faster decorator" #6006
[enhancement][system] Throw exception if using advanced indexing for tensor setitem #6001
[enhancement][system] Support eager boxing sm 2 sn #5869
[enhancement][system] Move framework/local_dep_object.* to the eager directory #5988
[enhancement][system] Fix builtin op arg tuple. #5464
[feature][system][refactor] Dev functional multiple signatures #5982
[enhancement][system] Faster decorator #5996
[enhancement][system] Placed nd sbp #5995
[feature][system] Support asymmetric input/output/variable tensors in nn.Graph #5983
[enhancement][system] LightActor #5868
[bug][system] Prevent running oneflow in forked subprocess #5976
[bug][system] common/error: fix build error in mac os #5971
[bug][system] fix_bug_test_tensor_str #5958
[enhancement][system] Refine StreamContext #6191
[enhancement][system] container_util: fix VectorAt, remove useless MutMapAt #6172
[enhancement][system] Typesafe KernelState #6198
[enhancement][system] Primitive based copy task node #6195
[feature][system][interface] Lazy support Scalar #6181
[enhancement][system] Disable implicit boxing when parallel num eq one #6188
[enhancement][system] Primitive #6183
[enhancement][system] Remove IDMgr::GetGpuPhyIdFromThrdId/IDMgr::GetDeviceTypeFromThrdId #6169
[enhancement][system] remove op_expr_helper inside gradient_funcs #6057
[feature][system][api] Add tensor yaml, support export tensor functional api. #6099
[feature][system] Plan memory log #6151
[feature][system] Add dtype bfloat16 #5304
[enhancement][system] StreamContext #6129
[bug][system] Fix wrong inplace acc grad #6146
[enhancement][system] UserKernel remove job_desc #6144
[enhancement][system][api] Fea/graph/add outputs buffer to enable pipeline #6126
[enhancement][system] not fuse request for nccl 2.10.3 #6136
[bug][system] NewUniqueId thread safe #6141
[enhancement][system] XRT remove job_desc #6139
[enhancement][system] SystemOpFillJobNamePass #6138
[enhancement][system] mv_boxing_folder_to_core #6140
[enhancement][system] Refactor boxing interpreter to boxing expr #6134
[enhancement][system] Eager boxing one to one #6048
[enhancement][system] Vm cpu efficiency #6110
[enhancement][system] Naive generic boxing #6116
[feature][system] send/recv #5992
[enhancement][system] disable_print_stack_in_tensor_numpy #6123
[feature][system] add all_reduce by to_consistent #5963
[enhancement][system] KernelContext #6084
[enhancement][bug][system] Fix sync nccl and async nccl deadlock #6071
[bug][system][refactor] Refactor to local #6098
[enhancement][system] Replace xor with hash combine (part 1) #6078
[enhancement][system] Optimize error message #6073
[enhancement][system] Rename Error::xx to Error::xxError #6049
[enhancement][system] send formatted msg to glog #5999
[feature][bottleneck][bug][system][interface] [Feat.] NNGraph new eager tensor for new variable created in JobPass #6091
[bug][system] Fix bug of multi-GPU eager copy D2H extra mem cost in rank 0 #6092
[enhancement][system][api] Rename module flow.F to flow._C #6053
[feature][system][interface] [Feat.] Eager consistent OFRecordReader #6089
[enhancement][system][api] Dev fix and align interface #6075
[feature][bottleneck][bug][system][interface] NNGraph input/output valid by register tensors #6240
[bug][system][interface] Fix bug of Multi-Client src tick output order #6221
[enhancement][bug][system] Add cast primitive #6234
[feature][bottleneck][system][interface] Auto FixPipelineStageIdPass #6204
[enhancement][system] move scalar to oneflow namespace. #6235
[enhancement][system] UserKernel init CUDA Graphs with state #6230
[feature][system] Comm broadcast #6213
[enhancement][system][refactor] Rename opname to optype_name in AutogradEngine #6154
[enhancement][system] Add memset primitive #6218
[enhancement][system] Add StreamContext::device_type()/DeviceCtx::device_type() #6217
[feature][system] add all_gather and fix bug of multi rank doctest #6189
[feature][system][interface] [Feat.] Lazy interpreter skip hierarchical_parallel_cast #6208
[purge][system] Cleanup KernelUtil #6212
[enhancement][system] StreamContextAdapter #6205
[enhancement][system] Dev eliminate gcc warnings #6199
[feature][bottleneck][system][interface] [Feat.] nn.Graph support grad acc with input/output tensor #6155
[enhancement][system] Cpu symetric s to s #6153
[enhancement][system][upload-core] Op expr infer tensor meta #5064
[enhancement][system] Infer consistent tensor meta #5362

CI enhancements:

[bug][ci][api][interface] Refine module test #5232
[enhancement][ci] Add Simple CI, runs CPU-only on GitHub hosted servers #5207
[enhancement][ci] Run exe test in CPU-only #5202
[enhancement][ci] Cancel all workflow runs but the latest #5206
[enhancement][ci] Fix master not running Simple CI #5368
[enhancement][ci] Refine Simple CI and Clang analysis #5367
[enhancement][feature][bug][ci][documentation][interface] Fix upsample bilinear bug #5363
[enhancement][ci] Build nightly for py39 #5318
[enhancement][ci] Try distributed run for 3 times to prevent failure #5305
[enhancement][ci] Upload Simple CI logs to cloud #5268
[enhancement][ci] Remove cpu_op_eager and cuda_op_eager #5470
[bug][ci] fix segfault in clang plugin #5437
[enhancement][ci] Refine Simple CI error output #5435
[enhancement][ci] Add conda env to Simple CI #5385
[enhancement][ci] Fix clang plugin core file not found #5390
[bug][ci] upload core when build with clang plugin #5384
[bug][ci] clang plugin skip more files #5373
[enhancement][ci] Use gh-action-scheduler-v2 #5370
[enhancement][ci] relax speed threshold #5569
[bug][ci] Fix wrong test path under compatible #5567
[enhancement][ci][need-simple-ci] Prevent upload logs automatically #5560
[enhancement][ci][interface] Add nn.AdaptiveAvgPool1d and nn.AdaptiveAvgPool3d #5445
[feature][ci] add speed test in ci #5496
[enhancement][ci] Reduce usage of Simple CI #5546
[feature][bug][ci][api] Restruct upsample module #5524
[feature][ci] multi client launcher test #5488
[enhancement][ci] Remove automerge if cuda_new_interface failed #5519
[enhancement][ci] Prevent adding subdir in python/test #5514
[enhancement][ci] piprepo->pipindex #5517
[enhancement][ci] add dynamic_loss_scale in ci tests #5337
[enhancement][ci] Add timeout for wait_gpu_slot #5497
[enhancement][feature][ci] new static check based on clang-tidy #5476
[enhancement][ci] Fix url not downloadable in some browers #5701
[feature][ci] multi client multi machine test #5685
[enhancement][ci] Add cpu new interface CI #5639
[enhancement][ci][need-simple-ci] Mv clangtidy to simple ci #5667
[enhancement][ci][need-simple-ci] use clang tidy appimage in ci #5841
[enhancement][ci] Use gcc 7 in release to prevent error #5840
[enhancement][ci] bn tol 1e-4 => 1e-3 #5811
[enhancement][ci] fix distributed run on built dir #5810
[enhancement][ci] fix third party mirror check_sum #5802
[ci][documentation] find more accurately which files need to be doctested #5782
[enhancement][ci] Print stack unconditionally #5779
[enhancement][ci][need-simple-ci] Enable more checkers for clang-tidy in CI #5738
[enhancement][ci] CI: add clang-tidy check to test.yaml #5920
[ci][documentation] fix docstring in oneflow.nn.functional namespace #5807
[enhancement][ci] disable TREAT_WARNINGS_AS_ERRORS in Release CI #5886
[enhancement][ci] Skip ci jobs by git diff #5863
[bug][ci] quick fix #5978 #6030
[enhancement][bug][ci] fix clang tidy diff options and file format #5990
[enhancement][ci] add flow.relu #5847
[enhancement][ci] equal => allclose #6164
[bug][ci][need-simple-ci] CI: fix clang tidy checks in simple ci #6161
[enhancement][bug][ci][documentation][api] add interpolate and layer_norm docs #6157
[bug][ci] update speed test #6113
[enhancement][bug][ci][documentation][api] speed import oneflow #6107
[bug][ci] Also try install dev deps for CODEGEN_PYTHON_EXECUTABLE #6115
[bug][ci][need-simple-ci] set gtest_CMAKE_DEBUG_POSTFIX "d" #6085
[enhancement][ci] add cache init file for clang and CI build with clang #6062
[enhancement][ci] add emoji in speed test output, make it continue-on-error #6214

Test enhancements:

[bug][test][interface] Fix acos ci bug #5217
[feature][test] implement automated test #5321
[enhancement][test] move generator test into ops folder to accelerate tests #5472
[feature][test][api] Add autotest part2 #5467
[enhancement][test][api][interface] Add some tests with the new framework for auto testing #5561
[bug][test] fix test error when do multi case test on graph #5590
[enhancement][test] Refine module test using auto test by yaochi #5484
[enhancement][test] Add autotest for BatchNorm2d #5734
[enhancement][test] RTH_update_op_test #5823
[enhancement][test] dev adamw graph config #5745
[feature][test][api][interface] Add new autotest #5562
[bug][test] restore test of alexnet graph #5798
[enhancement][test][interface] add zhangshen op-test #5600
[feature][bug][tooling][test][interface] Record autotest wrong code #5923
[enhancement][feature][test][api] add randint #5718
[bug][test] fix multi machine test #5984
[enhancement][test][interface] some op test #6095

Tooling enhancements:

[bug][tooling] user/summary: fix memory leak in FillImageInSummary #5742
[enhancement][tooling][cfg] cfg: add move assignment operator for performance #5962
[enhancement][tooling][api][refactor] refactor_all_device_placement_api #6080

oneflow - v0.3.0

Published by jackalcooper about 3 years ago

oneflow -

Published by jackalcooper about 3 years ago

Changelog

v0.5.0b1 (13/09/2021)

Highlights

First class support for eager execution. The deprecated APIs are moved to oneflow.compatible.single_client
Drop-in replacement of import torch for existing Pytorch projects. You could test it by inter-changing import oneflow as torch and import torch as flow.
nn.Module for eager execution
nn.Graph for lazy execution
DDP for data parallel

A sneak peek of the new API

Here is a minimum example showcasing how to incorporate a nn.Module in a nn.Graph and have it run in lazy mode.

class NeuralGraph(flow.nn.Graph):
    def __init__(self, ...):
        super().__init__()
        self.model = model # model is a nn.Module instance

    def build(self, x):
        y_pred = self.model(x)
        return y_pred

graph = NeuralGraph() # to create a nn.Graph instance
y_pred = graph(x) # to run the created nn.Graph

New in Python API

[feature][eager][op][test][python][interface] Add test for convtranspose2d #5239
[enhancement][python][interface] Add GroupNorm #5175
[enhancement][eager][python][interface] [Add] avgpool1d avgpool3d #5165
[feature][eager][op][python][interface] Add deconv cpu impl #5224
[bug][eager][api][python][interface] Fix acosh bug #5221
[feature][eager][op][python][interface] Dev modules ctc loss #5168
[bottleneck][bug][documentation][python][interface] Fix meshgrid test bug #5208
[eager][documentation][python][interface] Rename CosineScheduler to CosineAnnealingLR #5112
[feature][eager][python][interface] Add meshgrid module #5205
[enhancement][feature][bug][op][python] support bias in conv2d's parameter list #5322
[eager][documentation][api][python][interface] add not_equal, greater_equal and less_equal module #5350
[enhancement][eager][python] refine pow module and its test #5319
[enhancement][eager][op][python] Add triu op #5329
[enhancement][bug][python] Fix optimizer for not supporting all kinds of iterables #5355
[bug][python][interface] raise IndexError in get_canonical_index to support for loop #5345
[bug][python][interface] tensor slice assign supports broadcasting #5344
[enhancement][op][python] add cpu group conv logic #5314
[enhancement][python] Add 'nn.Mish' module and corresponding functions #5310
[enhancement][build][python] Remove ONNX from setup py #5297
[enhancement][python][interface] [add] zeropad2d #5278
[feature][system][python][interface] Lazy nn.Graph FeedInputOpExpr #5458
[feature][python][interface] integrate nn.image.flip #5411
[bug][python] Fix issues in point of MultiClientSession #5469
[enhancement][bug][python] update HasAllMultiClientEnvVars() #5459
[enhancement][python] Add in_top_k function #5428
[enhancement][python] Dev add docstring #5449
[feature][api][python] MultiClientSession #5407
[documentation][python] remove --user #5431
[feature][python][interface] nn.Graph python #5309
[feature][python][interface] Fea/nn graph/graph name #5413
[bug][python][interface] rm nn.Graph.train #5424
[op][documentation][api][python][interface] add bernoulli module #5353
[enhancement][python] flow.S/B/P #5306
[enhancement][documentation][python] Add instruction on upgrade pip #5400
[enhancement][python] Rm oneflow export and experimental #5589
[bug][python] Fix nn.graph.utils module conflict #5598
[feature][ci][python] Update autotest framework #5520
[enhancement][python] copy of_proto_python_dir to compatible_single_client_python #5539
[enhancement][api][python] del default env init #5537
[enhancement][python] Fix single client using same glog file #5535
[bug][api][python] Fix Session TryClose #5531
[enhancement][feature][python] split vector-matrix norm #5478
[feature][eager][op][python][interface] Add more upsample kernel #5382
[enhancement][feature][test][python] add torchstyle unittest #5489
[feature][system][python] nn.Graph with training #5662
[enhancement][feature][python] Fea/nn graph/block proxy func #5727
[enhancement][api][python] consistent_tensor_to_api #5703
[feature][eager][op][python] Dev Align torch avgpool #5610
[enhancement][python] fix circular deps of sbp python module #5706
[documentation][python] [part5]Remove singleclient outdated api #5674
[enhancement][python] [part4]Remove singleclient outdated api #5672
[bug][op][python] remove outdated code in conv3d #5696
[enhancement][test][python] enlarge tolerance of dataloader test #5689
[enhancement][test][python] add autotest for some math ops #5646
[feature][python] nn.Graph optimizer part 2: add L2, pass job complete, refactor #5604
[enhancement][python] Add clip_grad_norm #5299
[purge][python] Remove Single-Client API in oneflow default python #5827
[bug][python] Fix ddp grad size #5834
[enhancement][feature][python] Dev RMSprop graph conf #5768
[enhancement][purge][eager][python] remove scale arg in optimizer #5821
[enhancement][feature][python] graph/block io check #5803
[enhancement][feature][python] Dev adam graph conf #5709
[purge][python] [part10]Remove singleclient outdated api #5756
[feature][api][python] better repr of nn.Graph for debug #5762
[bug][python] fix weight decay in RMSprop #5755
[purge][python] [part9]Remove singleclient outdated api #5752
[purge][python] [part8]Remove singleclient outdated api #5750
[documentation][python] add first batch of methods in oneflow.nn.functional namespace #5693
[purge][python] [part6]Remove singleclient outdated api #5704
[bug][python] use default_generator.seed() as random_seed in init #5721
[bug][system][python] ddp broadcast params and buffers #5913
[enhancement][test][python] Add consistent tensor requires grad test #5925
[bug][python] wrap flow.nn.init.* with flow.no_grad() #5932
[feature][api][python][interface] add clip_grad to optimizer #5817
[enhancement][ci][op][test][python] add randperm with test and docs #5680
[feature][api][python] Fea/nn graph/ lr_schedule(and cosine lr_sch) and opt_group #5846
[bug][python] fix bug of SyncOnMasterFn atexit #5909
[purge][python] Delete single client nn modules #6061
[enhancement][python] Move framework.distribute to env #6022
[bug][python] skip sync when abnormally exiting #6025
[feature][python] Fea/nn graph/warmup amp config #5969
[documentation][python] add optimizer api docs #6131
[documentation][python] add_tensor_api_doc #6127
[bug][python] Fix test_grid_sample.py and test_affine_grid.py threshold #6125
[documentation][api][python] add doc of graph #6093
[bug][python] Fix make of_format fail in ubuntu #6120
[feature][api][python][interface] Fea/graph helpers #6088
[enhancement][eager][python][interface] Use flow.randint in dataloader #6086
[feature][eager][api][python][interface] Import oneflow as torch #6076
[enhancement][test][api][python][refactor] rename OfrecordReader to OFRcordReader #6090
[purge][python][need-single-client-tests] Delete single client nn modules #6082
[enhancement][python] flow.load tolerates FileNotFound fault #6083
[feature][python] Fea/pipeline in graph #6105
[enhancement][test][python] graph activation checkpointing #6192
[enhancement][feature][op][python] rnn test #6165

New in Ops:

[enhancement][op][api][refactor] [Functional] Part2: Add partial unary and math functional apis #5218
[enhancement][bug][op][interface] Refine deconv kernel #5229
[enhancement][op][api][interface] add ReflectionPad2d #5172
[feature][eager][op][api][interface] crossentropyloss and nllloss support ignore_index #5195
[feature][eager][op][api][interface] Yejiaojiao/dev bcewithlogitsloss #5173
[bug][ci][op] Dev user op set default is_dynamic #5223
[enhancement][op] add magic method for pow #5199
[enhancement][op][interface] add cpu version of upsampling #5194
[enhancement][bug][op][api][interface] add ReplicationPad2d #5148
[feature][eager][op][api][interface] add kldivloss module #5155
[feature][eager][op][documentation][build][api][interface] Add floor module and the corresponding testcases #4964
[enhancement][feature][op] Dev conv1d module #5280
[enhancement][op] Add ctc_greedy_decoder op #5294
[enhancement][op][system] Dev remove default grad func #5320
[enhancement][op][system] Add pad grad func. #5354
[enhancement][op][system] Add gradient funcs. #5348
[feature][purge][bug][eager][op][interface] fix upsample nearest bug #5347
[enhancement][op][system] [Functional] Part7: Migrate pooling ops #5253
[enhancement][op] nvjpeg hardware acc #5240
[enhancement][feature][ci][eager][op][api][interface] Add bmm module #5334
[enhancement][eager][op] Dev image decode eager #5333
[enhancement][op] Optimize softmax warp impl #4977
[enhancement][eager][op] Dev tensor buffer eager #5317
[enhancement][op][api][refactor] [Functional] Part6: Migrate conv op #5252
[enhancement][eager][op] Dev sort eager #5284
[enhancement][bug][op][api] fix bceloss bug in default weight and reduction #5303
[bug][eager][op] remove redundant assert and check #5264
[enhancement][bug][ci][op] fix bceloss bug about weight #5269
[enhancement][op][api][refactor] [Functional] Part5: Migrate nn ops #5249
[enhancement][eager][op] Dev argsort eager #5273
[enhancement][op][api][refactor] [Functional] Part4: Migrate array ops #5247
[enhancement][op][api][refactor] [Functional] Part3: Migrate binary and activation ops #5246
[bug][ci][op][test] Dev fix rmsprop ci fail #5481
[enhancement][op] add inplace method: Tensor.sin_ #5471
[bug][op] hotfix image_batch_align #5461
[enhancement][eager][op][interface] Dev maxpool series op 123d #5244
[bug][op] fix pool gpu kernel #5446
[feature][eager][op][api][interface] add pixelshufflev2 module #5383
[enhancement][feature][ci][eager][op][documentation][api][interface] Add flow xxx and tensor xxx autotest #5386
[enhancement][feature][eager][op][api][interface] Modules chunk #5324
[enhancement][eager][op] add image normalize for eager #5402
[enhancement][eager][op] Dev batch align module #5401
[enhancement][eager][op] add coco reader module #5391
[enhancement][wip][op] Restruct Elementwise kernel #4130
[bug][op] Fix DecodeRandom reuse mem #5606
[enhancement][op] Align pytorch maxpool #5525
[enhancement][bottleneck][eager][op][api] implementation of constantpad-3d op #5529
[enhancement][eager][op] Add scale size for resize #5509
[enhancement][op][api][refactor] Dev optimize tensor setitem #5501
[enhancement][op] register uint8 dtypeto support dataloader #5499
[enhancement][op] Add unique.cuh #5487
[enhancement][op][api][interface] Dev ofrecord auto truncating #5412
[feature][op][system][interface] Feat: LazyInterpret::ApplyImpl support SourceUserOpExpr and Copy #5711
[enhancement][op][interface] Dev logical_and/or modules #5636
[enhancement][op] support any number positional arguments for ones and zeros op #5698
[enhancement][feature][eager][op] Add conv3d Module #5327
[feature][eager][op][api][interface] add batchnorm3d module #5631
[bug][eager][op] fix reduce min max backward bug #5651
[enhancement][op] Debug dim scatter #5371
[enhancement][op][interface] Dev eye #5583
[enhancement][eager][op] Dev minimum maximum #5576
[enhancement][op] Restruct activation grad op #5669
[enhancement][feature][eager][op] Rewrite activation function #5465
[bug][op][documentation] add oneflow.cat for documentation #5621
[enhancement][op] Lcy logsoftmax #5746
[feature][op][need-simple-ci] Feat empty op #5659
[enhancement][eager][op] Dev split #5714
[enhancement][op][interface] add index_select op #5661
[bug][op] fix nvjpeg hw acc #5851
[enhancement][op] Remove move in conv_cudnn #5828
[enhancement][op][interface] Dev logical_xor module #5694
[bug][eager][op] fix squeeze #5808
[enhancement][op] Get parallel_id and parallel_num through rank and world size in DDP #5717
[bug][eager][op] delete interpolate int type #5805
[bug][op] Fix bug in scatter #5743
[enhancement][op] Refactor: remove module not required, call function directly #5754
[enhancement][op] Remove modules not required(tan, erfc, log1p, scatter_nd) #5791
[enhancement][op] Refactor scatter, clamp and pow in cpp instead of in python #5715
[enhancement][op] Rm useless code in gather files #5687
[enhancement][eager][op] change flip_code to scalar #5786
[enhancement][bug][op][interface] fix upsample bug #5753
[bug][op][interface] Quick fix Lazy nn.Graph input/output OpConf.BlobConf.is_dynamic #5767
[enhancement][bug][eager][op] fix argwhere 0-dim bug #5760
[enhancement][eager][op] delete unused code #5744
[feature][op] Export fused_scale_tril op #5933
[bug][op] Fix backward bug in 3d #5908
[bug][op] Fix one_hot api limit #5927
[enhancement][eager][op] Dev where scalar #5797
[bug][op] fix grad error #5914
[feature][bug][op] Fix inplace op circle reference bug #5910
[enhancement][op] Move the judgment content to c++， And add scalar fmod #5854
[enhancement][op] Support combined_margin_loss op in flow.nn.modules #5830
[enhancement][op][api][interface] functional_one_hot #5315
[enhancement][op] Dev scalar op #5778
[bug][eager][op] fix gather kernel 0 shape #5888
[enhancement][op] add l2_normalize for mutl-client interfaces #5859
[feature][op] Export function softmax_cross_entropy #6056
[enhancement][op] Add int attr for functional adaptive average pool #6059
[enhancement][op][interface] dev full op #5955
[bug][eager][op] fix 0dim inplace add #6029
[feature][op][system][interface] Feat: nn.Graph image gpu decoder #6014
[enhancement][op][interface] dev optim_optim_lr_scheduler_multisteplr #5975
[enhancement][op] NopKernel #6035
[enhancement][eager][op][api] Dev tril op #6005
[enhancement][op] dev unfold and fold #5675
[enhancement][op] ResNet CUDA Graphs #6018
[enhancement][feature][op] add broadcast pow #6013
[enhancement][op][interface] init of op diag #5298
[op][documentation][api] Fix api document bug #6009
[enhancement][op] Dev fused functional #5954
[bug][op][build] Add nvcc flag -Werror cross-execution-space-call #6002
[bug][op] Fix Normalization grad function #5993
[enhancement][feature][eager][op][test][interface] Add fused self attention #5966
[enhancement][bug][ci][eager][op][api][interface] Try to fix var bug #5973
[enhancement][feature][eager][op][interface] add prod op #5867
[enhancement][eager][op][api] add glu op #6065
[enhancement][op] Align Torch.nn.functional poolXd #6184
[bug][eager][op] fix backward index for gamma beta #6149
[bug][op][system] Fix BroadcastMatmulGrad bug #6168
[enhancement][op][api] Add Int support for functional.avg/maxpool #6174
[bug][eager][op][api][interface] align dropout api name with pytorch #6170
[enhancement][op] support inplace operation for hardsigmoid #6137
[enhancement][bug][op] Fix do bias correction in Adam/AdamW #5960
[bug][eager][op][api][interface] fix repeat 0-dim tensor bug #6150
[enhancement][bug][op] Fix select_first_grad bug #6142
[bug][ci][eager][op][documentation][interface] Add clipgrad doc and contiguous #6130
[bug][op] Fix eager optim dynamic attr bug #6111
[enhancement][op] Support grid_sample and affine_grid operator #6038
[op][documentation] Export apis for documentation #6068
[enhancement][feature][bug][ci][eager][op][documentation][interface] transfer python function to c++ method #6114
[op][documentation] Dev functional batch_gather #6233
[enhancement][op][test] fix cross_entropy_loss and its test #5799
[bug][op] Use attr nd_sbp to check consistent #6222
[enhancement][op] Dev fused bn functional #6077
[enhancement][op] support default value in intlist #6201
[bug][op] fix sparse_softmax get_nd_sbp #6203
[bug][op] Fix bug in model fused update #6197
[enhancement][op][system][refactor] Optimize tensor getitem. #5433

New in Eager:

[enhancement][eager][interface] Reconstruct module files #5251
[bug][eager][documentation][interface] Fix conv module bug #5245
[bug][ci][eager][interface] Fix bce withlogitloss ci error #5237
[feature][eager][api][interface] module BCELoss #5144
[enhancement][feature][eager][api][interface] Dev norm op #5178
[enhancement][bug][eager] Fix stack module #5222
[enhancement][feature][eager] Support different dtype of equal module #5214
[enhancement][bug][eager][documentation][api][interface] Add nllloss backward #5210
[enhancement][eager][api][upload-core] Decouple FileSystem and IOConf #5162
[enhancement][ci][eager] Set lower precision avoid ci failing #5200
[eager][documentation] Add hint when apply FunctionNode second time #5369
[enhancement][feature][bug][ci][eager][documentation][api] Fix upsample bilinear bug #5366
[bug][eager] Fix not contiguous ndarray to tensor bug #5351
[enhancement][eager][system] Infer consistent tensor meta #5118
[feature][eager] Feat graph autograd engine #5296
[enhancement][eager][interface] Dev type as module #5349
[feature][eager][documentation][api][interface] Add new ones module #5342
[enhancement][bug][eager] Fix logical slice assign dtype #5339
[bug][ci][eager][documentation][api][interface] Fix where module bug #5300
[bug][ci][eager][documentation][api] Fix l1loss ci error #5307
[enhancement][bug][eager][documentation][api][interface] Qi's First Edit of deleting "print" and ".numpy" #5129
[feature][eager][refactor] Separate autograd meta to tensor #5267
[feature][eager][api][interface] add tile module #5234
[enhancement][eager] Release lambda function to reuse tensor memory #5266
[feature][bug][eager][documentation] Fix default value not set bug #5483
[enhancement][eager][interface] [Add] gather_nd scatter_nd #5422
[enhancement][bug][eager] fix param #5473
[bug][eager] Fix Tensor.grad setter bug #5462
[enhancement][eager] Rename now_grad_arg to current_grad #5466
[eager][test][documentation][interface] Add autotest part1 #5436
[enhancement][eager] Use functional copy instead of op_builder #5460
[bottleneck][bug][eager][interface] fix -1 index not support bug #5448
[bug][ci][eager][documentation][api] Fix concat backward bug #5443
[enhancement][bug][ci][eager] Add autograd engine warning #5444
[feature][eager][api][interface] Smoothl1loss #5256
[enhancement][bottleneck][eager] remove device dtype params #5434
[bug][ci][eager][documentation][interface] Delete maxpool failed test #5409
[enhancement][eager][api] Add tensor grad assginment #5379
[enhancement][bug][eager] fix-abs #5398
[enhancement][bug][eager][interface] Fix bn track running stats #5393
[enhancement][bug][eager] Support uint dtype of constant op #5396
[enhancement][bug][eager][documentation][interface] Delete useless code upsample #5392
[enhancement][ci][eager][interface] add flow.view #5301
[enhancement][bug][ci][eager][api][interface] Add masked select module #5356
[bug][eager][interface] Fix batchnorm backward bug #5602
[enhancement][eager] Support weight_dacay(l2 actually) #5587
[feature][eager][documentation][api] Add new autotest #5588
[enhancement][eager][documentation][api] Dev fmod #5404
[feature][eager] Support inplace add #5432
[feature][eager][interface] Feat tensor stride property #5543
[enhancement][feature][eager][documentation][api] Add flip module #5541
[feature][eager] Feat module repr #5486
[enhancement][bottleneck][bug][eager][interface] Fix maxpool1d params #5493
[enhancement][feature][eager][interface] Dev flow.utils.data part1 #5406
[bug][eager][api] Fix tensor getitem bug #5474
[enhancement][eager][need-simple-ci] export datasets interface #5691
[enhancement][eager][system] rebase #5601
[enhancement][eager][test] added nn.RecordBytesDecoder with its test #5475
[enhancement][feature][eager][need-simple-ci] 0-dim tensor support #5552
[enhancement][bug][eager] rewrite slice_update backward #5677
[enhancement][bug][eager][interface] align view input style with torch #5676
[enhancement][eager][interface][need-simple-ci] add autotests for modules #5666
[enhancement][bottleneck][eager][interface] Dev constantpad1d op #5579
[enhancement][eager][api][interface] Restruct MathOps AutoTest #5654
[enhancement][bug][ci][eager] Fix flip bug #5657
[bug][eager][api][interface] Fix expand module bug #5650
[enhancement][bug][eager][documentation][api] Fix repeat bug #5633
[enhancement][eager][test][api][interface] Add new autotest #5617
[enhancement][eager][api][interface] Dev flow.utils.data part2 #5500
[enhancement][bug][eager] make setitem device match #5835
[bug][eager][api][interface] align reshape input param with pytorch #5804
[feature][bug][eager][api] Align where op with torch #5850
[enhancement][bug][eager][api] Restruct prelu op #5829
[bug][eager][need-simple-ci] fix pooling ceil_mode bug #5818
[enhancement][eager] stateful local kernel supports consistent #5789
[bug][eager][api][interface] Fix argwhere bug #5816
[enhancement][eager][documentation][api] dev-nonzero #5809
[enhancement][feature][eager][api] Add fake quantize op #5690
[enhancement][bug][eager][documentation][api] Add api #5663
[enhancement][eager] Refactor consistent infer result #5790
[bug][eager][need-simple-ci] skip dataloader test #5780
[bug][eager][need-simple-ci] fix 0-dim tensor.fill_ #5771
[enhancement][eager] Cpu mpi broadcast #5726
[feature][eager] Feat grad mode classes #5956
[enhancement][bug][eager] fix wrong names #5951
[enhancement][eager][system] Local dep object pool #5953
[enhancement][eager][interface] rename OpExprInterpState to AutoGradCaptureState #5918
[bug][eager] Fix linear bug #5945
[bug][eager] Fix tensor_meta update bug #5924
[enhancement][eager] use flow.randperm #5928
[enhancement][eager] consistent init/save/load #5896
[enhancement][bug][eager][documentation][interface] Restruct sort and argsort op #5911
[enhancement][bug][eager][interface] Try to fix the problem that the insightface cannot converge。 #5906
[enhancement][bug][eager][interface] Add autotest #5899
[enhancement][eager] The scheduler thread joins worker threads #5893
[enhancement][eager] Bugfix async callback #5881
[feature][eager] Feat tensor to bool #5836
[bug][eager] Remove inplace broadcast_add #5551
[enhancement][eager] Broadcast consistent shape and dtype #5784
[enhancement][eager] Fix optimizer list parameters input bug #5848
[enhancement][eager][interface] Dev flow.utils.data part3 #5644
[enhancement][eager][api] Normalize naming of modules #6066
[enhancement][feature][eager][api][interface] add truncnormal #6051
[enhancement][bug][eager] AutoMatedTest support test module.parameter.grad #6043
[enhancement][feature][bug][eager] add module call kwags #6069
[enhancement][eager][api][interface] add tensor.item tensor.tolist #6021
[enhancement][eager][api][interface] Export pool ops api #6047
[enhancement][bug][eager][test][documentation][interface] Add more autotest sample #6039
[enhancement][bug][eager][system] disable cuda_h2d stream #6020
[feature][eager][test][api][interface] Add autotest codegen #6019
[feature][eager][documentation] Refactor cosine lr scheduler #6000
[enhancement][eager][interface] tensor.cpu/tensor.cuda #5894
[enhancement][eager][api] Support consistent_tensor.to(dtype) #5991
[bug][eager][interface] remove redundant codes in ModuleDict #5961
[bug][eager] Fix LayerNorm check bug #6196
[enhancement][eager][api] Change dropout api #6182
[enhancement][good for pr][eager][api][interface] add: test convert dependency #6023
[enhancement][bug][eager][interface] Fix autotest codegen bug #6171
[bug][eager] restore instr_local_dep_object_pool_size for nccl #6160
[enhancement][eager][api][interface] Aligin pooling op functional api names with torch #6163
[feature][bug][eager][api][interface] delete file #6162
[bug][eager] Fix optim load_state_dict bug #6152
[enhancement][eager][api] add is_training to dropout functor #6148
[enhancement][eager] Decompose nd sbp boxing #5800
[enhancement][eager] support consistent_tensor.to(copy=True) #6122
[feature][eager] Static grad scaler #6135
[bug][eager] Fix LayerNorm expr bug #6121
[bug][eager][api] move numpy c api init in numpy.cpp, make np array contiguous before copying #6117
[enhancement][eager][refactor] Remove params from ParamGroup getitem #6096
[enhancement][feature][eager] Support tensor and optimizer serialization #6087
[enhancement][bug][eager] fix bug about tensor str in nonsymmetric cast and getitem in consist… #6239
[enhancement][eager] Cpu all reduce #5849
[feature][eager] Support assign copy interface #6228
[enhancement][eager][api][interface] Dev reconstruct pad ops #6223
[enhancement][eager][api][interface] support flow.cuda.is_available #6124
[bug][eager] make flow._C.local_all_reduce sync lanuched #6175
[enhancement][eager] Rename flow to oneflow in user hint #6190
[bug][eager][tooling][test][api][interface] Autotest generate input tensor #6206
[enhancement][eager] consistent tensor zeros_() #6202
[enhancement][eager] Cpu mpi #5865

Build enhancements:

[bug][build] Fix GRPC compilation failure on CMake 3.20 #5255
[bug][build] Refine header file copy #5254
[bug][build] Fix older version CMake doesn't support multiple targets in CLI #5248
[bug][build] Turn off NCCL_STATIC/CUDNN_STATIC when CUDA_STATIC is OFF #5243
[feature][build] Fix support for Ninja and add Ninja build in Simple CI #5236
[enhancement][build] Add cmake option CUDA_STATIC #5164
[bug][build] Fix protobuf debug postfix #5233
[enhancement][ci][build] Move default third party dir into build dir #5230
[enhancement][build] Refine protobuf cmake #5216
[enhancement][ci][build] Remove transport test main #5215
[enhancement][ci][build] Speedup opencv build #5213
[enhancement][build] Support clang #5015
[enhancement][documentation][build] Add prefix when creating git archive #5201
[enhancement][build] Add cmake option NCCL_STATIC #5160
[enhancement][build] Refine CMake CUDA version handling #5192
[enhancement][build] Use clang plugin to check Maybe variables are used #5358
[enhancement][build] Add BUILD_BYPRODUCTS for ExternalProject_Add #5316
[enhancement][build] Add cmake init cache to simplify user onboarding #5311
[feature][bug][build] Fix macOS support and run macOS build in Simple CI #4947
[enhancement][build] flatbuffers use mirror #5295
[enhancement][build] Don't build test by default #5302
[enhancement][build] Prevent building from scratch when toggle flag BUILD_GIT_VERSION #5259
[enhancement][build] Refine gRPC, glog, gflags cmake for conda #5276
[feature][build] Support XLA with CPU-only #5260
[enhancement][ci][onnx][build] Remove ONNX from CI #5257
[enhancement][build] Refactor build_wheel to support oneflowinc images #5427
[enhancement][build] Add arg skip_audit in build wheel #5423
[bug][build] hwloc disable shared #5388
[documentation][build] Update readme for autoconf and libtool #5376
[enhancement][build] remove dir python and compatible_single_client_python #5609
[bug][build][system] Fix pyyaml version #5594
[enhancement][ci][build] force release flags #5574
[bug][build] prevent endless loop #5534
[enhancement][build] Support sccache #5528
[enhancement][build] Add definition for CMAKE_BUILD_TYPE and print cmake_build_type in oneflow doctor #5505
[enhancement][ci][build][need-simple-ci] Fix macOS for recent changes #5705
[bug][build] fix return type error on gcc 4.8.5 #5660
[enhancement][build] Check CMAKE_BUILD_TYPE #5656
[enhancement][build] add -Werror=return-type #5655
[enhancement][build] Clean and fix for new py dir #5618
[enhancement][build] cmake: disable array-bounds check & treat warnings as errors for pyextobj and oneflow_internal & fix warnings #5838
[bug][build] set CMAKE_BUILD_TYPE to Release if undefined #5842
[enhancement][build][need-simple-ci] Fix all warnings & Add option TREAT_WARING_AS_ERROR to cmake #5751
[enhancement][build] add CMAKE_INTERPROCEDURAL_OPTIMIZATION in fast cmake cache #5970
[enhancement][build] add clang tidy target #5957
[bug][build] cmake: fix cmake cache args in opencv #5959
[enhancement][build] Add cmake option USE_SYSTEM_NCCL #5897
[enhancement][build] cmake: include third party headers as system headers to avoid warnings #5879
[enhancement][build] Ignore opencv-python on machine aarch64 #5884
[enhancement][build] enable CMake first class cuda support #5858
[bug][build] Fix compile warning (strict-aliasing) #5872
[enhancement][bug][build][need-simple-ci] Upgrade gtest and fix some errors raised by clang #6079
[bug][ci][build] cmake: fix ninja build in CI #6072
[bug][build] fix files not actually removed when building for multiple python versions #6060
[bug][build][api] functional_api: fix build error in mac os #6010
[bug][build][need-simple-ci][need-single-client-tests] Fix recompile from scratch #6036
[bug][build] Turn on NVCC's warnings #6011
[bug][build][need-single-client-tests] fix bundle .so of other python version #6034
[bug][ci][build][need-single-client-tests] use copy_all_files_in_dir to replace copy_files #6033
[enhancement][build] check compiler version in cmake #6026
[enhancement][build] Add CUDA_NVCC_THREADS_NUMBER #6017
[enhancement][build][need-simple-ci] optimize of_include_copy #5978
[enhancement][ci][build][need-single-client-tests] CI: remove -DTREAT_WARNINGS_AS_ERRORS=OFF #6008
[enhancement][build][xla] xrt: fix all warnings #5915
[enhancement][build] Prevent opencv compile failure with std 17 #5997
[enhancement][build] Use bundled cub #5998
[enhancement][ci][build] update clang tidy diff warnings-as-errors option #5989
[enhancement][build] Update run_clang_tidy.py to set return code and add warning-as-errors #5977
[enhancement][build] check: fix clang-tidy-diff commands #5972
[bug][build] Suppress NVCC warning #177-D #6094

XLA enhancements:

[bug][xla] Make the blob header memory aligned. #5286

System:

[enhancement][system] Refactor Memory Zone #5072
[enhancement][system] Add interface InferContext::OutputTensorDesc #5219
[bug][system] Lazy construct functor to make sure that the operators has already been registered. #5225
[enhancement][system] Refactor infer ctx output isdynamic #5220
[enhancement][system] Refactor infer ctx input isdynamic #5211
[enhancement][system] Wake up the heartbeat thread immediately #5081
[enhancement][system] Fix xla test case fail #5203
[enhancement][system] Add interface InferContext::InputDType #5153
[purge][system] delete const_cast in Output #5196
[feature][system] Add hwloc for topology detection #5291
[enhancement][system] fix registry may segment #5336
[enhancement][system] Use functional api instead of op_expr_helper::XXXOp. #5364
[enhancement][system] move btob to op #5274
[documentation][system] Add Latest News section in README #5361
[enhancement][bug][system] fix dropout module: return directly if not training #5346
[bug][system] add missing JUST #5357
[documentation][system] Add more communication outlets on README #5359
[enhancement][feature][system] CommNet dynamic register memory #5281
[enhancement][system] Use symbol device #5341
[enhancement][system] fix multithread bug in env #5283
[bug][system][api] fix bug in cfg_replacement #5335
[bug][system] Fix create log directory thread-unsafe #5326
[bug][system] fix_bug_in_make_parallel #5328
[enhancement][system][cfg] replace train_conf, job_conf using cfg::xx #5263
[enhancement][system][quantization] support tensorrt in qat #5287
[enhancement][system][api] Export functional apis for oneflow.experimental. #5313
[enhancement][system] fix bug check between cfg enum and proto enum #5285
[enhancement][system] replace CHECK_EQ using CHECK_EQ_OR_RETURN #5279
[enhancement][system] Refactor SbpXXX to cfg::SbpXXX #5120
[enhancement][system][api] add detach for LazyMirroredtensorImpl #5270
[enhancement][system] shorten XXIsDynamic4ArgNameAndIndex to be xxIsDynamic #5265
[enhancement][system][cfg] job_config to cfg #5235
[feature][system] Multi-Client LogicalRun degenerate to PhysicalRun #5479
[enhancement][system] fix ConstructOp without JUST #5480
[enhancement][system] Output arg modifier return maybe part 1 #5451
[feature][system][interface] Fea/nn graph/graph build ctx #5420
[enhancement][system] Throw exception if check failed #5457
[feature][system] multi client launch #5372
[enhancement][system][api] Optimize reduce mean #5452
[enhancement][system] export Tensor only to python #5440
[enhancement][system] Output arg modifier return maybe part_0 #5447
[enhancement][system] ThreadMgr support AddPlan #5450
[enhancement][system] Refactor infer ctx input tensordesc #5226
[enhancement][system][api] instruction builder return maybe #5442
[feature][system][interface] MultiClientSessionContext #5421
[enhancement][feature][system] add launcher, update multi client launch and exit #5414
[purge][system][refactor] Remove IOConf #5419
[enhancement][system] Dev refine generator #5426
[enhancement][system] Support inplace operations #5204
[enhancement][system][refactor] Dev refactor generator #5397
[enhancement][system] Add new placement init func #5408
[enhancement][system] NNGraphIf #5387
[enhancement][system][refactor] Cast explicitily in unpack call to avoid confilt with Optional. #5380
[enhancement][system][interface] [Random Generator] Part2: Migrate functional dropout #5378
[enhancement][system] replace ForeignJobInstance using JobInstance #5374
[enhancement][system][refactor] Speedup reshape module by 5x. #5381
[feature][system][interface] [Random Generator] Part1: Dev random generator #5360
[enhancement][system] Add ONEFLOW_STREAM_CUDA_EVENT_FLAG_BLOCKING_SYNC #5612
[enhancement][system] [part2]Remove singleclient outdated api #5568
[feature][system][interface] nn.Graph call and launch impl #5580
[enhancement][system] remove outdated doctest api and "@experimental_api" #5564
[feature][system][interface] Register ForeignCallback and Watcher in Multi-Client #5591
[enhancement][system] [Part-1]remove outdated api and files of multi-client on master branch #5556
[feature][system][interface] LazyInterpret build LocalTensor if input is local #5582
[enhancement][system] add job_pass MultiClientAutoSourceAndSinkTick #5507
[feature][system] Fea/nn graph/optimizer #5533
[feature][system][interface] New/CloseRuntimeBuffers and RunLazyJob impl #5571
[feature][system][refactor][interface] NNGraph interface and implement for CompileAndRuntime #5558
[feature][system] Fea/nn graph/forward graph #5516
[enhancement][system] Lazy job stream type #5389
[enhancement][system] Refactor single client autotick #5506
[enhancement][system] replace underline using dot in single client #5547
[bug][system] fix return type #5548
[feature][system][interface] LazyInterpret for UserOpExpr #5544
[enhancement][system] Add ProfilerStart/ProfilerStop API #5542
[feature][system][interface] LazyInterpreter for FetchOutputOpExpr and set op parallel_distribution #5527
[enhancement][system] Multi client push pull #5492
[enhancement][system] registry_callback_fn return maybe #5456
[enhancement][system] bw_gen_fn return maybe #5455
[enhancement][system] gen_bw_fn return maybe #5454
[enhancement][system] Compatible single client #5417
[feature][system][interface] GlobalMultiClientEnv and refine EagerExecution #5523
[enhancement][system] Job pass maybe system #5503
[enhancement][system] Remove Plan::net_topo #5502
[feature][system][interface] LazyInterpret for FeedVariableOpExpr #5490
[enhancement][system] Input arg modifier return maybe #5453
[feature][system][interface] Fea/nn graph/block scope #5498
[feature][system] jit_fuse_cast_scale #5332
[enhancement][system] Remove obsolete Profiler #5747
[enhancement][system][api] Dev fix batch norm not stats #5733
[enhancement][system] rename rpc_token to TransportToken #5735
[enhancement][system][api] Refacotr maximum minimum py2cpp #5724
[enhancement][system] Replace piece_id with comm_net_sequence_number #5731
[enhancement][system] beautify stack frame #5686
[enhancement][system] Add env ONEFLOW_KERNEL_DISABLE_BLOB_ACCESS_CHECKER #5728
[enhancement][system] Add env ONEFLOW_THREAD_ENABLE_LOCAL_MESSAGE_QUEUE #5720
[enhancement][system][api][refactor] Refactor functional sub, mul and div apis #5713
[feature][system] ddp #5008
[enhancement][system][api][refactor] Refactor functional matmul and add apis. #5697
[bug][system] Fix ClearKV("plan") #5710
[enhancement][system] Rename cpu to async cpu #5712
[enhancement][system] Support tensor.to()/to_local() #5271
[feature][system][refactor][interface] Multi-Runtime for multi nn.Graph #5683
[bug][system][refactor] Add tag for Optional inplace constructor #5619
[enhancement][system] Move Global to env scope #5670
[enhancement][system] add JUST wrapper #5681
[enhancement][system] New sync consistent meta info #5634
[enhancement][system][refactor][interface] Refactor RuntimeCtx for multi-runtime #5664
[feature][system][interface] Feat: memory shared between EagerTensor with VariableRegst #5649
[enhancement][system] Use functional call directly instead of construct a module and then call-Add #5613
[enhancement][system] disable eager_op consistent mode #5647
[enhancement][system] add msg_penddin_list in ibverbs_qp to optimize qp_init_attr.cap.max_send_wr #5485
[enhancement][system] IBVerbsCommNet add knobs #5626
[enhancement][system] Prune python tensor #5596
[feature][system][interface] Feat: LazyInterpret infer op / tensor ParallelDescScope #5625
[enhancement][system] Replace src tick with with wait and send ids #5603
[enhancement][system] Support symbol placement type in functional. #5627
[enhancement][system][api][refactor][interface] Dev advanced indexing #5559
[enhancement][system] Optimize maybe. #5839
[enhancement][system] Decorator 4 disable recursive boxing call #5796
[enhancement][system] add_eager_boxing_and_op_interpreter_dispatch_error_info #5819
[enhancement][system] Kernel CUDA Graphs Support #5725
[bug][system] Fix placement print bug #5853
[bug][system] when error msg formatting fails, return error->DebugString #5844
[enhancement][system][refactor] Rename variables named *parallel_distribution* to *nd_sbp* (1) #5815
[feature][system][interface] Support Free EagerTensor caught in nn.Graph build #5777
[enhancement][system] Reuse CUDA event / Refine BnInOp2Blob / Refine channel #5837
[enhancement][system][serving] fix bug in AddInputOutputOpsPass: check existence of key in HashMap(inferface_lbi2scope_sym_id) #5653
[enhancement][system][api] unpack_call: impl new unpack_call_dispatcher for better performance #5820
[feature][system] Feat consistent tensor python constructor #5812
[feature][system] Support 0shape tensor #5620
[documentation][system] fix launcher description #5770
[feature][system][interface] Multi-nn.Graph memory reuse by Chunk manager #5658
[bug][system] Fix naive b2p error #5806
[enhancement][system] set created generator with default rng seed #5801
[enhancement][system] enhance_local_to_consistent #5761
[feature][system] add flow.randn #5736
[enhancement][system] Refactor hierarchical parallel cast autograd #5764
[enhancement][system] Collective boxing executor add_plan delete_plan #5495
[enhancement][system] Fix throw abort #5795
[enhancement][system] DECORATE #5794
[enhancement][system] Inferface eager boxing #5682
[enhancement][system] extract_consistent_to_consistent_op_expr #5870
[enhancement][system] disable backward pass consistent tensor meta check. #5871
[enhancement][system] Add CudaStreamIndexGenerator::GenerateNamedStreamIndex #5940
[bug][system] Only query PCI bus id when CUDA version >= 11 #5937
[enhancement][system] maybe: add JUST_MSG and CHECK_JUST_MSG #5904
[bug][system] Fix bug scalar #5950
[enhancement][system] framework: fix rvalue reference warnings #5948
[purge][system] Remove CudaWorkType #5942
[enhancement][system] refactor_symbol #5941
[bug][system] consistent_tensor_infer_cache: fix memory leak #5938
[feature][system] support to print gpu #5936
[enhancement][system] Bugfix static check #5935
[bug][system] fix nccl_version log #5934
[bug][system] Fix bug of multi-GPU train nn.Graph extra mem cost in rank 0 #5930
[enhancement][system] Only gradient acc be scheduled in parallel. #5926
[enhancement][bug][system] fix_ddp_bug_on_8_process #5929
[enhancement][system] Fix bug error msg format #5866
[feature][system] print consistent tensor data #5902
[bug][system] Move parse env to the constructor #5922
[enhancement][system] Remove GlobalWorkStreamId/GlobalThrdId #5917
[bug][system] shared_or_scalar: fix alias warnings #5916
[purge][system] Remove CompActor #5919
[enhancement][system] Use symbol dtype #5641
[enhancement][feature][system] Control Graph / Session / Env's python c++ object destruction #5845
[enhancement][bug][system] Sync access and assign indexing tensor. #5907
[enhancement][system][api][refactor] Dev consistent arange #5883
[enhancement][system] Lazy interpreter for new ConsistentToConsistentOpExpr #5903
[bug][system] Fix BUG of LazyInterpret FreeEagerTensor memory shared with regst #5891
[bug][system] fix typo in raise RuntimeError #5890
[enhancement][system][refactor] Rename the ParallelDistribution class to NdSbp #5814
[feature][system] add flow.rand #5722
[feature][system] Lazy Interpret support infer default device cpu #5880
[enhancement][system] Tensor str #5783
[feature][system][interface] Lazy to_consistent #5774
[enhancement][system] wait vm empty before exiting #5860
[enhancement][system] Eager boxing n to 1 #5949
[enhancement][system] add kernel observer #6052
[enhancement][ci][system] Optimize ddp broadcast and add speed/memory test in ci #6044
[enhancement][system] add var to control only print warning once when blocked #6045
[enhancement][system][refactor] Rewrite pow and logical functional apis #6032
[enhancement][system] Token seq id #5964
[enhancement][documentation][system] Remove python function wrapper. #6012
[feature][system] Add timeout and loc for blocking calls #6007
[enhancement][system] Eager boxing 1 to n #5943
[enhancement][system] Boxing expr #6015
[enhancement][system] new_X_to_B #5987
[enhancement][system] Add unimplemented return information #5952
[enhancement][system] Revert "Faster decorator" #6006
[enhancement][system] Throw exception if using advanced indexing for tensor setitem #6001
[enhancement][system] Support eager boxing sm 2 sn #5869
[enhancement][system] Move framework/local_dep_object.* to the eager directory #5988
[enhancement][system] Fix builtin op arg tuple. #5464
[feature][system][refactor] Dev functional multiple signatures #5982
[enhancement][system] Faster decorator #5996
[enhancement][system] Placed nd sbp #5995
[feature][system] Support asymmetric input/output/variable tensors in nn.Graph #5983
[enhancement][system] LightActor #5868
[bug][system] Prevent running oneflow in forked subprocess #5976
[bug][system] common/error: fix build error in mac os #5971
[bug][system] fix_bug_test_tensor_str #5958
[enhancement][system] Refine StreamContext #6191
[enhancement][system] container_util: fix VectorAt, remove useless MutMapAt #6172
[enhancement][system] Typesafe KernelState #6198
[enhancement][system] Primitive based copy task node #6195
[feature][system][interface] Lazy support Scalar #6181
[enhancement][system] Disable implicit boxing when parallel num eq one #6188
[enhancement][system] Primitive #6183
[enhancement][system] Remove IDMgr::GetGpuPhyIdFromThrdId/IDMgr::GetDeviceTypeFromThrdId #6169
[enhancement][system] remove op_expr_helper inside gradient_funcs #6057
[feature][system][api] Add tensor yaml, support export tensor functional api. #6099
[feature][system] Plan memory log #6151
[feature][system] Add dtype bfloat16 #5304
[enhancement][system] StreamContext #6129
[bug][system] Fix wrong inplace acc grad #6146
[enhancement][system] UserKernel remove job_desc #6144
[enhancement][system][api] Fea/graph/add outputs buffer to enable pipeline #6126
[enhancement][system] not fuse request for nccl 2.10.3 #6136
[bug][system] NewUniqueId thread safe #6141
[enhancement][system] XRT remove job_desc #6139
[enhancement][system] SystemOpFillJobNamePass #6138
[enhancement][system] mv_boxing_folder_to_core #6140
[enhancement][system] Refactor boxing interpreter to boxing expr #6134
[enhancement][system] Eager boxing one to one #6048
[enhancement][system] Vm cpu efficiency #6110
[enhancement][system] Naive generic boxing #6116
[feature][system] send/recv #5992
[enhancement][system] disable_print_stack_in_tensor_numpy #6123
[feature][system] add all_reduce by to_consistent #5963
[enhancement][system] KernelContext #6084
[enhancement][bug][system] Fix sync nccl and async nccl deadlock #6071
[bug][system][refactor] Refactor to local #6098
[enhancement][system] Replace xor with hash combine (part 1) #6078
[enhancement][system] Optimize error message #6073
[enhancement][system] Rename Error::xx to Error::xxError #6049
[enhancement][system] send formatted msg to glog #5999
[feature][bottleneck][bug][system][interface] [Feat.] NNGraph new eager tensor for new variable created in JobPass #6091
[bug][system] Fix bug of multi-GPU eager copy D2H extra mem cost in rank 0 #6092
[enhancement][system][api] Rename module flow.F to flow._C #6053
[feature][system][interface] [Feat.] Eager consistent OFRecordReader #6089
[enhancement][system][api] Dev fix and align interface #6075
[feature][bottleneck][bug][system][interface] NNGraph input/output valid by register tensors #6240
[bug][system][interface] Fix bug of Multi-Client src tick output order #6221
[enhancement][bug][system] Add cast primitive #6234
[feature][bottleneck][system][interface] Auto FixPipelineStageIdPass #6204
[enhancement][system] move scalar to oneflow namespace. #6235
[enhancement][system] UserKernel init CUDA Graphs with state #6230
[feature][system] Comm broadcast #6213
[enhancement][system][refactor] Rename opname to optype_name in AutogradEngine #6154
[enhancement][system] Add memset primitive #6218
[enhancement][system] Add StreamContext::device_type()/DeviceCtx::device_type() #6217
[feature][system] add all_gather and fix bug of multi rank doctest #6189
[feature][system][interface] [Feat.] Lazy interpreter skip hierarchical_parallel_cast #6208
[purge][system] Cleanup KernelUtil #6212
[enhancement][system] StreamContextAdapter #6205
[enhancement][system] Dev eliminate gcc warnings #6199
[feature][bottleneck][system][interface] [Feat.] nn.Graph support grad acc with input/output tensor #6155
[enhancement][system] Cpu symetric s to s #6153
[enhancement][system][upload-core] Op expr infer tensor meta #5064
[enhancement][system] Infer consistent tensor meta #5362

CI enhancements:

[bug][ci][api][interface] Refine module test #5232
[enhancement][ci] Add Simple CI, runs CPU-only on GitHub hosted servers #5207
[enhancement][ci] Run exe test in CPU-only #5202
[enhancement][ci] Cancel all workflow runs but the latest #5206
[enhancement][ci] Fix master not running Simple CI #5368
[enhancement][ci] Refine Simple CI and Clang analysis #5367
[enhancement][feature][bug][ci][documentation][interface] Fix upsample bilinear bug #5363
[enhancement][ci] Build nightly for py39 #5318
[enhancement][ci] Try distributed run for 3 times to prevent failure #5305
[enhancement][ci] Upload Simple CI logs to cloud #5268
[enhancement][ci] Remove cpu_op_eager and cuda_op_eager #5470
[bug][ci] fix segfault in clang plugin #5437
[enhancement][ci] Refine Simple CI error output #5435
[enhancement][ci] Add conda env to Simple CI #5385
[enhancement][ci] Fix clang plugin core file not found #5390
[bug][ci] upload core when build with clang plugin #5384
[bug][ci] clang plugin skip more files #5373
[enhancement][ci] Use gh-action-scheduler-v2 #5370
[enhancement][ci] relax speed threshold #5569
[bug][ci] Fix wrong test path under compatible #5567
[enhancement][ci][need-simple-ci] Prevent upload logs automatically #5560
[enhancement][ci][interface] Add nn.AdaptiveAvgPool1d and nn.AdaptiveAvgPool3d #5445
[feature][ci] add speed test in ci #5496
[enhancement][ci] Reduce usage of Simple CI #5546
[feature][bug][ci][api] Restruct upsample module #5524
[feature][ci] multi client launcher test #5488
[enhancement][ci] Remove automerge if cuda_new_interface failed #5519
[enhancement][ci] Prevent adding subdir in python/test #5514
[enhancement][ci] piprepo->pipindex #5517
[enhancement][ci] add dynamic_loss_scale in ci tests #5337
[enhancement][ci] Add timeout for wait_gpu_slot #5497
[enhancement][feature][ci] new static check based on clang-tidy #5476
[enhancement][ci] Fix url not downloadable in some browers #5701
[feature][ci] multi client multi machine test #5685
[enhancement][ci] Add cpu new interface CI #5639
[enhancement][ci][need-simple-ci] Mv clangtidy to simple ci #5667
[enhancement][ci][need-simple-ci] use clang tidy appimage in ci #5841
[enhancement][ci] Use gcc 7 in release to prevent error #5840
[enhancement][ci] bn tol 1e-4 => 1e-3 #5811
[enhancement][ci] fix distributed run on built dir #5810
[enhancement][ci] fix third party mirror check_sum #5802
[ci][documentation] find more accurately which files need to be doctested #5782
[enhancement][ci] Print stack unconditionally #5779
[enhancement][ci][need-simple-ci] Enable more checkers for clang-tidy in CI #5738
[enhancement][ci] CI: add clang-tidy check to test.yaml #5920
[ci][documentation] fix docstring in oneflow.nn.functional namespace #5807
[enhancement][ci] disable TREAT_WARNINGS_AS_ERRORS in Release CI #5886
[enhancement][ci] Skip ci jobs by git diff #5863
[bug][ci] quick fix #5978 #6030
[enhancement][bug][ci] fix clang tidy diff options and file format #5990
[enhancement][ci] add flow.relu #5847
[enhancement][ci] equal => allclose #6164
[bug][ci][need-simple-ci] CI: fix clang tidy checks in simple ci #6161
[enhancement][bug][ci][documentation][api] add interpolate and layer_norm docs #6157
[bug][ci] update speed test #6113
[enhancement][bug][ci][documentation][api] speed import oneflow #6107
[bug][ci] Also try install dev deps for CODEGEN_PYTHON_EXECUTABLE #6115
[bug][ci][need-simple-ci] set gtest_CMAKE_DEBUG_POSTFIX "d" #6085
[enhancement][ci] add cache init file for clang and CI build with clang #6062
[enhancement][ci] add emoji in speed test output, make it continue-on-error #6214

Test enhancements:

[bug][test][interface] Fix acos ci bug #5217
[feature][test] implement automated test #5321
[enhancement][test] move generator test into ops folder to accelerate tests #5472
[feature][test][api] Add autotest part2 #5467
[enhancement][test][api][interface] Add some tests with the new framework for auto testing #5561
[bug][test] fix test error when do multi case test on graph #5590
[enhancement][test] Refine module test using auto test by yaochi #5484
[enhancement][test] Add autotest for BatchNorm2d #5734
[enhancement][test] RTH_update_op_test #5823
[enhancement][test] dev adamw graph config #5745
[feature][test][api][interface] Add new autotest #5562
[bug][test] restore test of alexnet graph #5798
[enhancement][test][interface] add zhangshen op-test #5600
[feature][bug][tooling][test][interface] Record autotest wrong code #5923
[enhancement][feature][test][api] add randint #5718
[bug][test] fix multi machine test #5984
[enhancement][test][interface] some op test #6095

Tooling enhancements:

[bug][tooling] user/summary: fix memory leak in FillImageInSummary #5742
[enhancement][tooling][cfg] cfg: add move assignment operator for performance #5962
[enhancement][tooling][api][refactor] refactor_all_device_placement_api #6080

oneflow - v0.4.0

Published by jackalcooper over 3 years ago

Changelog v0.4.0

Highlights

在这个版本，我们为 OneFlow 新增了大量的功能，0.4.0 是 OneFlow 自开源以来最大的更新。在这个版本中，我们增加了 2-D SBP、流水并行，Checkpoint 的新的接口，以及大量对齐 pytorch 的接口，还支持了 CUDA 11.2。在之前，我们已经开源了 OneFlow 的 GPT 源码，其中大量使用了这个版本的各种新特性，同时也欢迎移步阅读《OneFlow —— 让每一位算法工程师都有能力训练 GPT》这篇文章。

Lazy 模式的功能更新

支持 2-D SBP

转为2维

with flow.scope.placement("gpu", "0:0-3", (2, 2)):
    x = flow.hierarchical_parallel_cast(
        x, parallel_distribution=["B", "S(1)"]
    )

转为1维

with flow.scope.placement("gpu", "0:0-3", (4,)):
    x = flow.hierarchical_parallel_cast(
        x, parallel_distribution=["S(0)"]
    )

支持流水并行的新接口

创建 pipeline_stage 的 scope

with flow.experimental.scope.config(
        pipeline_stage_id_hint=dist_util.get_layer_stage(layer_idx)
    ):
    ...

为了是流水并行能更好的工作，必须使用梯度累加，可以使用有限内存跑更大 batch。通过 config 设置梯度累加的步数：

func_cfg = flow.FunctionConfig()
...
func_cfg.train.num_gradient_accumulation_steps(args.num_accumulation_steps)
@flow.global_function(..., function_config=func_cfg)

支持 ZeRO 优化

开启方式：

func_cfg = flow.FunctionConfig()
...
func_cfg.optimizer_placement_optimization_mode(mode) # mode  = "non_distributed" or "distributed_split"
@flow.global_function(..., function_config=func_cfg)

示例代码请参考这个测试用例
mode = "distributed_split" 对应 DeepSpeed ZeRO 优化的 stage 2

支持 Checkpointing 的新接口

with flow.experimental.scope.config(
    checkpointing=True
):

Eager 模式的功能更新

提供`oneflow.experimental` 命名空间，部分对齐 `torch.xxx` 接口

新接口的使用方法

import oneflow.experimental as flow
flow.enable_eager_execution() # 启用 eager

目前部分对齐的功能

flow.nn.Conv2d  <->  torch.nn.Conv2d
flow.nn.BatchNorm2d  <->  torch.nn.BatchNorm2d
flow.nn.ReLU  <->  torch.nn.ReLU
flow.nn.MaxPool2d  <->  torch.nn.MaxPool2d
flow.nn.AvgPool2d  <->  torch.nn.AvgPool2d
flow.nn.Linear  <->  torch.nn.Linear
flow.nn.CrossEntropyLoss  <->  torch.nn.CrossEntropyLoss
flow.nn.Sequential  <->  torch.nn.Sequential

flow.nn.Module.to  <->  torch.nn.Module.to
flow.nn.Module.state_dict  <->  torch.nn.Module.state_dict
flow.nn.Module.load_state_dict  <->  torch.nn.Module.load_state_dict

flow.save  <->  torch.save
flow.load  <->  torch.load

flow.Tensor  <->  torch.Tensor
flow.tensor  <->  torch.tensor
flow.tensor.to  <->  torch.tensor.to
flow.tensor.numpy  <->  torch.tensor.numpy
flow.tensor 加减乘除  <->  torch.tensor 加减乘除
flow.tensor.flatten  <->  torch.tensor.flatten
flow.tensor.softmax  <->  torch.tensor.softmax

flow.optim.SGD  <->  torch.optim.SGD

基于上述模块已经可以轻松搭建常用网络，如：ResNet、BERT、MobileNetV3 等。后续版本将对齐/支持更多接口，届时可将大多数基于 Pytorch 搭建的网络，轻松切换到 OneFlow。

快速上手例子 lenet: https://github.com/Oneflow-Inc/models/blob/main/quick_start_demo_lenet/lenet.py
新接口文档链接：https://oneflow.readthedocs.io/en/master/experimental.html
对齐 torch vision 的 ResNet50 示例代码：https://github.com/Oneflow-Inc/models/tree/main/resnet50
接下里的几个版本会增加更多对齐 PyTorch 的接口
experimental 下对齐的接口在 0.6.0 版本更新时会被移动到 oneflow 的命名空间下，届时会完全对齐 PyTorch，OneFlow 0.6.0 会将 eager 作为默认的执行方式
eager 模式目前只支持单 GPU 运行，在 0.5.0 会支持多 GPU 运行

其他更新

新的 Python Pip 包名和版本号规则

之前一个 OneFlow 的版本采取的是“不同包名，相同版本名”的规则，如 oneflow_cu102==0.3.4，从 0.4.0 之后将采取“相同包名，不同版本名”的规则，如oneflow==0.4.0+cu102，最新安装方式请参考 README Install with Pip Package章节

支持 CUDA 11.2

stable 版本和 nightly 版本的 OneFlow 都支持了 CUDA 11.2 平台（cu112）

ONNX 模块独立仓库

ONNX 模块目前在新仓库 https://github.com/Oneflow-Inc/oneflow_convert_tools 中维护，OneFlow 主仓库中的 ONNX 相关的代码将在下个版本移除，具体细节可以看《深度学习框架OneFlow是如何和ONNX交互的？》一文。oneflow_convert_tools 目前是针对 OneFlow 的 lazy 模式开发，目前最新版本号为 v0.3.2，后面针对 eager 模式的 oneflow_convert_tools 版本号将从 0.4.0 开始

"下集预告"

在下一个版本的 OneFlow 中，将包含更全面的 PyTorch 兼容，包括更多更丰富的接口支持以及多 GPU 支持。同时，下个版本的 OneFlow 也将支持动静图转换的功能。敬请期待！

oneflow - Hotfix v0.3.4

Published by jackalcooper almost 4 years ago

oneflow - v0.3.3

Published by jackalcooper almost 4 years ago

Op 修复和性能优化

[enhancement][op] reduce sum half kernel #4110
[enhancement][op] simplify cosface #4107
[enhancement][op] indexed_slices update support weight_decay #4096
[enhancement][op][python] Migrate swish and mish namespace from math to nn #4104
[enhancement][op] Add elementwise maximum/minimum ops #4069
[enhancement][op] Fix Code format warning in hardswish #4105
[enhancement][feature][op] Add Scalar Pow #4082
[bug][op] Fix bug: mut_shape_view of static output maybe null in UserKernel::ForwardShape #4094
[enhancement][op][refactor] Migrate cast_to_static_shape to user op #4095
[feature][op] Add GroupNorm op #4089
[feature][op] Distributed partial sampler #3857
[enhancement][op][python] add Relu6 activation #4029
[bug][op] Rename ont_hot_op.cpp to one_hot_op.cpp #4093
[bug][op][python] fix hardtanh CI precision error #4091
[enhancement][op] add remove_img_without_anno api for COCOReader #4088
[enhancement][op] Add Hardtanh activation #4049
[enhancement][op] Add ELU activation #4065
[enhancement][op][python] Update logsoftmax.py #4041
[documentation][op] Fix in_top_k api document #4079
[enhancement][op] Add Hardswish activation #4059
[enhancement][op][python] Add hard sigmoid #4043
[enhancement][op] Dev in top k #3611
[bug][op] Fix argwhere tmp buffer infer #4061
[enhancement][op] Optimize softmax cuda kernel #4058
[feature][op] Add InstanceNorm 1d & 3d implementation #4052
[feature][op] Quantization aware training releated ops #3764
[enhancement][op] Generic unfold kernel implementation #4033
[enhancement][op] User op dim_gather support dynamic input and index #4039
[enhancement][op] Reflection pad2d op #3777
[enhancement][op] slice support empty blob #4025
[bug][enhancement][op] Migrate argwhere to user op #4021
[bug][op] Dev rm old tanh #4035
[enhancement][op][refactor] Make MaxWithLogThreshold and SafeLog header only #4030
[op][purge] Tidy up op_conf.proto #3932
[enhancement][op][python] Dev bcewithlogits loss #4024
[feature][op] Add implementation of InstanceNorm2D op #4020
[enhancement][op][refactor] Refactor gpu_atomic_add #4027
[enhancement][op][python] add kldivloss #4012
[enhancement][op][python] Dev oneflow ones #3990
[enhancement][op] Add flatten/squeeze/expand_dims to auto mixed precision clear list and use reshape instead of reshape_like to do reshape grad computation #4015
[enhancement][op][python] add pixel shuffle #4003
[enhancement][op] Scalar kernels use element-wise template #4013
[enhancement][op][python] add zeros api #3991
[enhancement][op] Optimize ComputeEntropyGpu with CUB #3930
[feature][op] CUDA template for element-wise kernels #4007

系统组件

[enhancement][system] migrate job_build_and_infer api to pybind11 #3940
[feature][system] quantization aware training pass #3817
[eager][enhancement][system] Mig op arg para attr #4102
[feature][system] Tensor Float 32 Support. #4072
[enhancement][system] Mig op arg para attr #4090
[enhancement][system] Mig py cfg sbp #4086
[enhancement][system] Refactor python remote blob #4081
[enhancement][system] remove BlobDef #4071
[bug][system] Fix warning: moving a local object in a return statement prevents copy elision #4067
[enhancement][system] Refactor python blob desc #4063
[feature][system] Add nvtx range and thread naming #4064
[documentation][enhancement][system] Add docs on installing legacy versions of oneflow #4056
[bug][system] support eager empty blob #4047
[enhancement][system] Add err info for ncclGroupEnd check #4048
[enhancement][system] Optimize dynamic loss scale parameters #4045
[purge][system] Remove col_id #4046
[enhancement][system] Scope with symbol #4040
[enhancement][system] Job desc with symbol #4032
[enhancement][system] Parallel desc with symbol #4017
[bug][system] change sbp order value for layer norm #3995
[bug][system] Fix eager test_resume_training test #4023
[bug][system] Fix python cfg error bug #4018
[bug][system] Remove redundant pack_size in GenericLauncher #4014
[enhancement][system] Set default block size to 512 #4011
[feature][system] Remove swig in oneflow #3969
[feature][system] Migrate oneflow internal api to pybind11 #3953
[build][enhancement][system] Bump nccl from 2.7.3 to 2.8.3 #3875

Eager 模式

[bug][eager] Fix eager bug of test split like #4004
[bug][eager] add float16 datatype for eager boxing #4092

Python 前端

[feature][python] add stack #3897
[bug][enhancement][python] Fix test kldivloss tolerance #4103
[bug][enhancement][python] Fix "hardsigmoid" eager test error #4085
[bug][documentation][python] Add hardsigmoid #4076
[api][enhancement][python] add deprecate api optimizer.PolynomialSchduler #4038

工具链

[feature][tooling] split_cfg_cpp_and_pybind_generator #4002
[enhancement][tooling] Cfg hash #4084
[enhancement][tooling] Finetune cfg tool #4050
[enhancement][tooling] optimize link time #4042

编译

[build][documentation] Add CentOS specific info on README.md #4099
[build][enhancement] Disable CUDA_SEPARABLE_COMPILATION #4036

CI

[bug][ci] Quit docker after making ssh creadential #4075
[bug][ci] Fix CI outputs wrong cmd when printing failed cmd due to shadowed var #4031
[ci][enhancement] Upload log of distributed CI #4028
[ci][enhancement] Make oneflow worker docker stay alive for 6 hours #4026
[ci][enhancement] Allow to keep oneflow_worker log in distributed CI #4022
[ci][documentation] userop and general pr templates added #3952

oneflow - v0.3.2

Published by jackalcooper almost 4 years ago

Changelog

v0.3.2 (16/12/2020)

[enhancement][system] Migrate foreigns to pybind11 #3939
[feature][op][python] add swish activation #3970
[bug][op] fix argwhere format #4010
[enhancement][op] Argwhere support empty blob #4009
[feature][op][python] add mish activation #3972
[bug][eager] Fix eager memory leak and re-enable new checkpoint #4008
[ci][enhancement] upload bin to oss #4000
[enhancement][op] Fuse cast scale #3999
[enhancement][op] layer_norm_grad_add_to_output #3998
[enhancement][system] Optimize NcclCollectiveBoxingExecutorBackend::ExecuteGroup latency #3997
[feature][system] OptimizerPlacementOptimization #3944
[enhancement][op] Dev optimize prelu #3987
[api][enhancement][op] Switch identity to user op and add it to auto mixed precision clear list #3992
[enhancement][op] Optimize slice kernel #3989
[bug][op] Hotfix: add parallel cast to amp clear list #3988
[bottleneck][enhancement][system] Sublinear memory cost by checkpointing #3976
[enhancement][system] Add gradients stats aggregation #3979
[feature][system] nccl enable mixed fusion #3981
[enhancement][op] fused_scale_tril / hot fix matmul / softmax broadcast_sub broadcast_div #3980
[bug][op] add combined margin cpu and fix bug #3961
[feature][op] Add multi_square_sum op #3977
[bug][op] fix pad op #3971
[ci][enhancement][test] larger tol for bn #3965
[cfg][enhancement][python] Dev replace py job conf proto to cfg #3856
[enhancement][refactor][ssp] Dev ssp fix fuse and add just #3959
[cfg][enhancement][refactor][tooling] replace ScopeProto to cfg #3816
[feature][op] TripOp add fill value #3960
[enhancement][system] remove serialized in python callback #3891

oneflow - v0.3.1

Published by jackalcooper almost 4 years ago

Changelog

v0.3.1 (02/12/2020)

[bug][system] Fix CollectiveBoxingGenericTaskNode::ProduceAllRegstsAndBindEdges #3946
[bug][op] Fix constant init value #3947
[api][enhancement][refactor][tooling] Refine custom op build #3925
[feature][op] add combined margin loss #3819
[enhancement][tooling] default show cpp error stack frame #3948
[cfg][enhancement][tooling] Dev replace py parallel conf proto to cfg #3810
[feature][system] Add NaiveB2PSubTskGphBuilder #3942
[bug][system] disable new checkpoint by default temporarily #3943
[bug][system] Explicitly specify the SBP in NonDistributedOptimizerPass #3937
[bug][op] indexed_slices_model_update handle empty tensor #3933
[bug][ci] fix oss list file 100 limit #3935

oneflow - v0.2.0

Published by jackalcooper about 4 years ago

Changelog

v0.2.0 (09/10/2020)

Op 修复、性能优化

支持二元 add op 与前驱节点融合

FuseAddToOutput #3524
Dropout support add_to_output #3569
Dev matmul add to output #3581

kernel 性能优化

Fused BatchNormAddRelu #3519
bn_add_relu use bit mask #3645
layer_norm param grad #3604
Fused layer norm #3591
BiasAdd Row Col Half2 #3636
MaskAndScaleHalf2 #3643
Optimize CudaAsyncMemoryCopier #3543
Avoid using local memory in CropMirrorNormalizeGpuKernel #3539
LayerNormGpuKernel use fused InstanceScaleCenter #3573

使用 user op 实现 model update ops，以及 model update ops 支持 fusion

Add model update user ops #3546
Migrate L1L2RegularizeGradientOp to UserOp Framework #3527
model update fuse scalar_mul_by_tensor #3635
Dev indexed slices model update user ops #3561
Dev adam xla and rm sys op #3584

NCCL 支持设置最大融合 op 数量

Add nccl_fusion_max_ops #3567

新 op

[feature] Fused ImageDecoderRandomCropResize #3644
Add AmpWhiteIdentityOp #3658
Add ImageDecoderRandomCropResizeOp::InferParallelSignature #3646
Dev add op tril #3511
add masked fill op #3515

cuDNN 算法推导支持全局缓存

Add CudnnConvAlgoCache #3649

Bugfix 与其他

fix broadcast div grad #3525
fix optimizer copy-paste bug #3508
fix bug about pad value #3640
Optimize some default values #3648
Fix cuda runtime #3621
Fix reshape inplace #3545
Refactor rmsprop mean_square and add unit tests for optimizers #3523
Remove cuDNN fields from OperatorConf #3536
Add UserOpConfWrapperBuilder::ScopeSymbolId #3528
Fix NcclCollectiveBoxing builder_name #3563
rm conv2d cpu testcase #3574
fix broadcast_to_compatible_with grad bug #3609
Add inline for half #3600
Fix converter half #3599
Fix gpu_atomic_max double overload use fmaxf #3578
fix upsample #3579

Eager Execution

给eager相关的代码加上更多注释；微调stateless_call指令，区分mutable_input和output两类不同的参数；实现broadcast指令；

fix fmt cuda_copy_d2h_stream_type #3606
add comments for cuda_copy_d2h_stream_type.cpp #3603
Fix TopoForEachNode in GenCollectiveBoxingPlan #3566
Split call_op_kernel instruction args into const_input/mutable_input/output #3562
split BlobObject and EagerBlobObject #3485
remove unused code under vm/ #3585
Dev broadcast instruction #3555
Broadcast instruction #3552

pybind11 集成

现在 OneFlow 内 SWIG 和 pybind11 共存，之后会逐步切换到 pybind11

pybind11 integration #3517
upgrad to pybind11 master and pass exe path #3522
Update rel script for pybind11 #3526
Dev oneflow pybind api #3625

优化、修复编译工具

修复了一些导致编译失败缓慢的不合理配置、加速了依赖下载、修复了 ubuntu dockerfile

[bug] fix ubuntu docker build #3504
change link order to fix the cpu+openblas build #3634
[bug] fix bug: oneflow cpu-only lib flags #3615
add convert_url_to_oss_https_url and DCN flag #3595
Add cn url in readme #3583
make absl use tar not git #3570
Optimize nvcc gencode flag #3577

Transport 网络传输子系统

支持 P2P 动态网络传输

[feature] Transport #3549

集成 CFG 工具

CFG 是基于 proto 语法的、生成跨 python、C++ 数据交互代码的工具

Dev integrate cfg #3597
Less usage of PbMessage in Operator #3651

XLA 支持优化

升级到了 TF 最新版本

upgrade XRT XLA to TF 2.3.0 #3531
Fix XLA crash #3548

GRPC 升级

升级到了 GRPC 最新版本

Upgrade grpc #3551
[bug] [bugfix] GRPC: control server CompletionQueue shutdown. #3589

CI、测试优化

将 XLA 也加入 CI，优化了 op 的测试用例，自动上传 master 最新 commit

Parallel unit tests (Step 1, refactor existing unit tests) #3632
Add build type for pr oss upload #3627
XLA ci support #3564
Auto upload tar to aliyun oss #3592
Don't pack source code if it is not master #3593
move fmt to github hosted #3559
refactor ci #3557
CtrlTest find available port for ctrl port instead of handwriting #3610

ONNX 支持

优化 IR，更新测试脚本

onnx update #3495

增加、修订文档

Add api docs zzk #3505
Add api docs zzk #3533
Add api docs zzk #3514
fix masked_fill op doc #3560

Python 前端修复

Fix the bug of using op_module_builder in namespace scope #3513
Comment release global for now to avoid random crash in python #3629
update lib name in link flags #3623
rm spaces in rm_spaces optimizer.py #3619

优化、修复系统通用组件

[enhancement] flat ErrorProto error_type #3474
[enhancement] Added user_op_conf getter for BatchAxisContext/KernelInitContext/SbpContext #3506
[bug] Fix UserOpConfWrapper::has_input/has_output #3507
support reflecting cfg message #3655
Refactor scope #3652
Refactor placement scope #3650
Bugfix split config proto and session job set #3637
[Bug fix] Release global variables #3624
Add OpRegistry::SetAreaId #3608
Dev converter #3580
Tensor::dptr support half #3582
Use InferOutBlobDescsIf instead of InferBlobDescsIf in InferOpNodeLogicalBlobDesc #3535
Add ctrl_in_op_name only when unreachable #3537