Bot releases are visible (Hide)
OneFlow v1.0.0 came out, welcome to install the new version for a better experience.
This version update includes 447 commits and the following highlights:
Released a new interface compile_from_torch
. This interface, while sharing the parameter memory, converts a PyTorch Module instance into a OneFlow Module instance. It supports direct Eager execution or conversion into a static graph nn.Graph, further accelerating the process using MLIR compilation. This interface is rapidly evolving and currently supports dynamic shape compilation, validated across typical models such as ResNet50, Faster RCNN, and Stable Diffusion.
Made a series of optimizations and refactoring to Eager execution runtime, including unification of system memory pools, integration with CUDA native interfaces, optimization of instruction scheduling mechanisms, introduction of an instruction fusion mechanism, optimization of Autograd graph construction speed, optimization of Op inference process, and decoupling of Instruction and Stream, etc.
The static graph distributed physical execution plan supports separate compilation functionality, allowing each process to independently compile its required execution plan, eliminating linear growth of compilation time with GPU scale.
Addition of a series of functional automatic differentiation related interface supports, including jvp, vjp, hvp, vhp, jacobian, and hessian.
Addition of the Insight module, supporting visualization of kernel invocation, execution time, speed, and other related information within the embedded point intervals.
Updates to LiBai (the open-source toolbox for large-scale model training), with native support for fine-tuning and distributed inference of Llama2 and ChatGLM2, supporting full finetune, adapter finetune, lora finetune. lm-eval-harness can be used for language model evaluation and validation.
Upgrade of OneFlow Serving functionality, adding support for OneFlow Python backend and OneFlow Lite backend, in addition to the existing support for OneFlow Cpp backend.
The compile_from_torch
interface, while sharing the parameter memory, converts a PyTorch Module instance into a OneFlow Module instance. It supports direct Eager execution or conversion into a static graph nn.Graph, further accelerating the process using MLIR compilation. (https://github.com/Oneflow-Inc/oneflow/pull/10404, https://github.com/Oneflow-Inc/oneflow/pull/10408, https://github.com/Oneflow-Inc/oneflow/pull/9984, https://github.com/Oneflow-Inc/oneflow/pull/9754)
Interface Signature and Parameter Introduction:
compile_from_torch(torch_module: torch.nn.Module, \*, use_graph=True, options={})
* torch_module: The Torch Module instance to be converted.
* use_graph: Indicates whether to transform into a static graph nn.Graph and utilize MLIR compilation acceleration. The default is True.
* options:
* size: When using static graph nn.Graph, the hash value of the graph corresponding to the input shape will be calculated and cached. Size indicates the maximum capacity of the static graph cache. When exceeding the maximum capacity, the graph will be cleared based on the LRU strategy. The default value is 9.
* dynamic: For the first input with a dynamic shape, the graph will be fully compiled. For subsequent inputs with different shapes, if dynamic is True, shared graph will be used for compilation acceleration; if dynamic is False, the compilation will be performed each time. The default is True.
* debug: Debug mode and log level settings. -1 disables debug mode, 0 outputs warnings and static graph construction information, 1 additionally outputs graph construction information for each sub-module, 2 additionally outputs progress for each operator, 3 provides more detailed operator information. The default value is -1.
Example of Usage:
import torch
from torchvision import models
import oneflow
from oneflow.framework.infer_compiler import compile_from_torch
DEVICE = torch.device("cuda")
WEIGHT = models.ResNet50_Weights.DEFAULT
model = models.resnet50(weights=WEIGHT).to(DEVICE)
compile_model = compile_from_torch(model, options={"dynamic": True})
The static graph distributed physical execution plan supports separate compilation , allowing each process to independently compile its required execution plan, thereby preventing linear growth of compilation time with GPU scale. The separate compilation feature supports 3D hybrid parallel (data parallelism + model parallelism + pipeline parallelism) scenarios and can be used together with LiBai (the open-source large-scale model training toolbox). To enable this feature, use the command: export ONEFLOW_ENABLE_LAZY_SEPARATE_COMPILE=1
. (https://github.com/Oneflow-Inc/oneflow/pull/9920, https://github.com/Oneflow-Inc/oneflow/pull/10140, https://github.com/Oneflow-Inc/oneflow/pull/10141, https://github.com/Oneflow-Inc/oneflow/pull/10124, https://github.com/Oneflow-Inc/oneflow/pull/10102)
Below are the test results on a 128-card A100-PCIE-40GB device with LiBai on the GPT2 model:
Parallelism | Separated Compilation Enabled | Execution Plan Compilation Time |
---|---|---|
Data Parallelism (DP128 MP1 PP1) | No | Over 20 minutes |
Data Parallelism (DP128 MP1 PP1) | Yes | 108.21 s |
3D Parallelism (DP4 MP4 PP8) | No | 445.16 s |
3D Parallelism (DP4 MP4 PP8) | Yes | 82.88 s |
A series of functional automatic differentiation-related interfaces have been introduced, including jvp, vjp, hvp, vhp, jacobian, and hessian. (https://github.com/Oneflow-Inc/oneflow/pull/10412, https://github.com/Oneflow-Inc/oneflow/pull/10428)
Example of Usage:
import oneflow as flow
# jacobian example
def exp_reducer(x):
return x.exp().sum(dim=1)
input = flow.rand(2, 2)
jac_rslt = flow.autograd.functional.jacobian(exp_reducer, input)
# vhp example
def pow_reducer(x):
return x.pow(3).sum()
input = flow.rand(2, 2)
v = flow.ones(2, 2)
vhp_rslt = flow.autograd.functional.vhp(pow_reducer, input, v)
Introduced a new Insight module, enabling the visualization of kernel invocation, execution time, speed, and other related information within the embedded point intervals. (https://github.com/Oneflow-Inc/oneflow/pull/10370)
Usage:
For more detailed information, please refer to: https://github.com/Oneflow-Inc/oneflow/tree/master/python/oneflow/utils/insight#usage
LiBai (the open-source toolbox for large-scale model training) has been upgraded to version v0.3.0. It now natively supports finetuning and distributed inference of large language models Llama2 and ChatGLM2. It supports full full finetune, adapter finetune, lora finetune. lm-eval-harness can be used for language model evaluation and validation.
The distributed training and inference support for ChatGLM and Llama2 are as follows:
Example of Usage:
# full finetune
bash tools/train.sh projects/Llama/train_net.py projects/Llama/configs/llama_sft.py 8
# adapter finetune
bash tools/train.sh projects/Llama/adapter/train_net.py projects/Llama/adapter/adapter_sft.py 8
# inference
bash tools/infer.sh projects/Llama/pipeline.py 8
# eval
python projects/Llama/utils/eval_adapter.py
Added FFT-related operators. (https://github.com/Oneflow-Inc/oneflow/pull/10027)
Added zeta
operator. (https://github.com/Oneflow-Inc/oneflow/pull/10189)
Added tril_
operator. (https://github.com/Oneflow-Inc/oneflow/pull/9996)
Added clone
operator. (https://github.com/Oneflow-Inc/oneflow/pull/9800)
Added frac
and frac_
operator. (https://github.com/Oneflow-Inc/oneflow/pull/9979)
Added exp2
operator. (https://github.com/Oneflow-Inc/oneflow/pull/9958)
Added rrelu
operator. (https://github.com/Oneflow-Inc/oneflow/pull/9736)
Added lgamma
backward operator. (https://github.com/Oneflow-Inc/oneflow/pull/10177)
Added digamma
operator. (https://github.com/Oneflow-Inc/oneflow/pull/10066)
Added trigamma
operator. (https://github.com/Oneflow-Inc/oneflow/pull/10117)
Added bitwise_not
operator. (https://github.com/Oneflow-Inc/oneflow/pull/9859)
Added squared_relu
operator. (https://github.com/Oneflow-Inc/oneflow/pull/10316)
Added skip_rms_norm
operator. (https://github.com/Oneflow-Inc/oneflow/pull/10036)
Added multi_tensor_amp_grad_scaler
related operators. (https://github.com/Oneflow-Inc/oneflow/pull/10071)
Added bitwise_and
, bitwise_or
, bitwise_xor
operators. (https://github.com/Oneflow-Inc/oneflow/pull/9842)
Added fused_attention_concat_past_key_value
operator. (https://github.com/Oneflow-Inc/oneflow/pull/9963)
Added fused_multi_head_attention_inference_v2
operator. (https://github.com/Oneflow-Inc/oneflow/pull/9933)
Added fused_codegeex_qkv_reshape
operator. (https://github.com/Oneflow-Inc/oneflow/pull/9927)
Added fused_apply_rotary_emb
operator. (https://github.com/Oneflow-Inc/oneflow/pull/9914)
Added skip_layer_norm
operator. (https://github.com/Oneflow-Inc/oneflow/pull/9906)
Added groupwise_dequantize
, fused_linear_with_groupwise_quantized_weight
operators. (https://github.com/Oneflow-Inc/oneflow/pull/9900)
Added fused_scale_mask_bias_softmax
, fused_scale_mask_bias_softmax_grad
operators. (https://github.com/Oneflow-Inc/oneflow/pull/9867)
Added depend
operator for describing dependency relationships in the computation graph. (https://github.com/Oneflow-Inc/oneflow/pull/9807)
Added operators for handling complex data types: real
, imag
, conj
, conj_physical
. (https://github.com/Oneflow-Inc/oneflow/pull/10034, https://github.com/Oneflow-Inc/oneflow/pull/10281)
Added CPU support for the nms
operator. (https://github.com/Oneflow-Inc/oneflow/pull/10225)
Added support for the cast
operator to convert bool
to int16
data type. (https://github.com/Oneflow-Inc/oneflow/pull/10211)
Added support for the arange
operator for the fp16
data type. (https://github.com/Oneflow-Inc/oneflow/pull/10019)
Added support for the adaptive_avg_pool
operator for the fp16
data type. (https://github.com/Oneflow-Inc/oneflow/pull/10004)
Added support for the nonzero
operator for the fp16
data type. (https://github.com/Oneflow-Inc/oneflow/pull/9826)
Added support for the exponential
operator for the half
data type. (https://github.com/Oneflow-Inc/oneflow/pull/10005)
Added support for the arg_sort
and top_k
operators for the half
data type. (https://github.com/Oneflow-Inc/oneflow/pull/10000)
Added support for some basic operators like add
, sub
, mul
, mm
, sqrt
, div
for complex data types. (https://github.com/Oneflow-Inc/oneflow/pull/10269, https://github.com/Oneflow-Inc/oneflow/pull/10136, https://github.com/Oneflow-Inc/oneflow/pull/10284, https://github.com/Oneflow-Inc/oneflow/pull/10049)
Added support for basic binary operators for discontinuous memory input tensors. (https://github.com/Oneflow-Inc/oneflow/pull/9986)
Added a virtual jit
interface to support mocking of torch for user code that imports but does not actually use the interface. (https://github.com/Oneflow-Inc/oneflow/pull/10395)
Added the mem_get_info
interface to return overall and free memory information for a specified CUDA device. (https://github.com/Oneflow-Inc/oneflow/pull/10398)
Added the tensor.new
interface. (https://github.com/Oneflow-Inc/oneflow/pull/9881)
Added the tensor.is_cpu
interface. (https://github.com/Oneflow-Inc/oneflow/pull/10172)
Added the tensor.is_view
interface. (https://github.com/Oneflow-Inc/oneflow/pull/10101)
Added the tensor.data_ptr
interface. (https://github.com/Oneflow-Inc/oneflow/pull/10111, https://github.com/Oneflow-Inc/oneflow/pull/10139)
Added the tensor.baddbmm
interface. (https://github.com/Oneflow-Inc/oneflow/pull/9918)
Added interfaces like special.erf
, special.erfc
, etc. (https://github.com/Oneflow-Inc/oneflow/pull/9982)
Added the layout
and frombuffer
interfaces. (https://github.com/Oneflow-Inc/oneflow/pull/10171)
Added prune-related interfaces. (https://github.com/Oneflow-Inc/oneflow/pull/9730)
Added the utils.model_zoo
interface. (https://github.com/Oneflow-Inc/oneflow/pull/10183)
Added the get_rng_state
and get_rng_state_all
interfaces. (https://github.com/Oneflow-Inc/oneflow/pull/9760)
Added the set_rng_state
and set_rng_state_all
interfaces. (https://github.com/Oneflow-Inc/oneflow/pull/10250)
Added support for the float16
data type. (https://github.com/Oneflow-Inc/oneflow/pull/9697)
Added support for the char
and short
data types. (https://github.com/Oneflow-Inc/oneflow/pull/10086)
Added support for the complex64
and complex128
data types. (https://github.com/Oneflow-Inc/oneflow/pull/9987)
Integrated Transform Dialect into the MLIR codegen process. (https://github.com/Oneflow-Inc/oneflow/pull/10224, https://github.com/Oneflow-Inc/oneflow/pull/10227)
Added code generation support for the matmul
operator. 。(https://github.com/Oneflow-Inc/oneflow/pull/10283)
Added code generation support for the softmax
operator. (https://github.com/Oneflow-Inc/oneflow/pull/10263, https://github.com/Oneflow-Inc/oneflow/pull/10272)
Added code generation support for the transform.oneflow.apply_patterns
operator. (https://github.com/Oneflow-Inc/oneflow/pull/10255)
Added support for byte
attributes in the MLIR codegen process. (https://github.com/Oneflow-Inc/oneflow/pull/10276)
Added extra_libs
functionality to the mock_torch
module, enabling flowvision to mimic torchvision's functionality. (https://github.com/Oneflow-Inc/oneflow/pull/10223)
Added lazy
parameter to the mock_torch
module, allowing non-existent interfaces to return a fake object without immediate errors. (https://github.com/Oneflow-Inc/oneflow/pull/9876)
Added skip_init
functionality and introduced meta device. (https://github.com/Oneflow-Inc/oneflow/pull/10008)
Introduced the HostMemoryInput mechanism, allowing an operator's specific input to be defined as HostMemoryInput type for accessing data within the kernel's host function body. (https://github.com/Oneflow-Inc/oneflow/pull/9928)
Added fusion mechanism for nccl logical operations to reduce excessive synchronization overhead in scenarios like ZERO, where too many fragmented nccl calls lead to significant training speed reduction. (https://github.com/Oneflow-Inc/oneflow/pull/9879)
Introduced a mechanism for re-computation of tensor operations. (https://github.com/Oneflow-Inc/oneflow/pull/9861)
Added support for backward_hook
, register_full_backward_hook
, and register_state_dict_pre_hook
. (https://github.com/Oneflow-Inc/oneflow/pull/9837, https://github.com/Oneflow-Inc/oneflow/pull/9710)
Added content related to the stochastic weight averaging algorithm to the optimizers module. (https://github.com/Oneflow-Inc/oneflow/pull/9781)
Added graph-level flattening algorithm. (https://github.com/Oneflow-Inc/oneflow/pull/9718, https://github.com/Oneflow-Inc/oneflow/pull/9748)
Added DelayVariableOpExecutionPass optimization pass for the computation graph. (https://github.com/Oneflow-Inc/oneflow/pull/9745)
Added MulCastPattern
operator fusion rule. (https://github.com/Oneflow-Inc/oneflow/pull/9715)
Added the environment variable ONEFLOW_ENABLE_GLOBAL_INPUTS_WITH_INCONSISTENT_PLACEMENT
to control whether to automatically place global tensors used by operators through the to_global
operation on the largest rank. (https://github.com/Oneflow-Inc/oneflow/pull/10073)
Added the environment variable ONEFLOW_EAGER_NCCL_USE_COMPUTE_STREAM
to control whether nccl
and regular computations in eager mode are on the same stream. The default value is false
. (https://github.com/Oneflow-Inc/oneflow/pull/10230)
Added the environment variable VLOG_REMAT
to handle dynamic graph recomputation logs and interface with ComputeComplexityFn to estimate op computation time. (https://github.com/Oneflow-Inc/oneflow/pull/10212)
Added the environment variable ENABLE_ACTOR_DEBUG_LOG
to print detailed logs of actor message sending, receiving, and execution on the current rank. (https://github.com/Oneflow-Inc/oneflow/pull/10081)
Added the environment variable ONEFLOW_RUN_GRAPH_BY_VM
to control whether to use VM to run static graph nn.Graph. (https://github.com/Oneflow-Inc/oneflow/pull/9884)
Added the environment variable ONEFLOW_DISABLE_MOCK_TORCH
to control whether to disable the mock_torch
functionality. (https://github.com/Oneflow-Inc/oneflow/pull/9805)
Added the environment variable ONEFLOW_VM_MULTI_THREAD
to control the number of threads used in the VM. (https://github.com/Oneflow-Inc/oneflow/pull/9698)
Added support for the second-order optimizer lbfgs
. (https://github.com/Oneflow-Inc/oneflow/pull/10265)
A series of optimizations and refactoring has been implemented for the Eager runtime, including:
Unified system memory pool to manage memory resources across all allocators on the same device. (https://github.com/Oneflow-Inc/oneflow/pull/8591)
Integration with CUDA native interfaces to accelerate kernel launches.(https://github.com/Oneflow-Inc/oneflow/pull/8571)
Optimization of instruction scheduling mechanism to reduce system overhead.(https://github.com/Oneflow-Inc/oneflow/pull/8796)
Introduction of instruction fusion mechanism to accelerate instruction dispatch. (https://github.com/Oneflow-Inc/oneflow/pull/7399)
Speed improvement in Autograd graph construction. (https://github.com/Oneflow-Inc/oneflow/pull/8606)
Optimization of op deduction process to accelerate kernel execution. (https://github.com/Oneflow-Inc/oneflow/pull/8672, https://github.com/Oneflow-Inc/oneflow/pull/8619, https://github.com/Oneflow-Inc/oneflow/pull/8662)
Consolidation of redundant concepts within the eager runtime, decoupling Instruction and Stream. (https://github.com/Oneflow-Inc/oneflow/pull/8583, https://github.com/Oneflow-Inc/oneflow/pull/8590, https://github.com/Oneflow-Inc/oneflow/pull/7607)
Users can configure the Eager runtime using various environment variables:
Environment Variable | Meaning | Default Value |
---|---|---|
ONEFLOW_VM_COMPUTE_ON_WORKER_THREAD | Whether to perform computation on worker threads | true |
ONEFLOW_VM_MULTI_THREAD | Whether to use multi-threaded collaboration for Eager computation | true |
ONEFLOW_VM_ENABLE_STREAM_WAIT | Whether to use stream_wait mechanism for dependencies between multiple streams | true |
ONEFLOW_VM_ENABLE_SCHEDULE_YIELD | Whether to use yield mechanism to reduce scheduler thread's busy wait | true |
ONEFLOW_EAGER_ENABLE_LOCAL_INFER_CACHE | Whether to cache operator output metadata during computation | true |
ONEFLOW_VM_WORKER_THREAD_LIMIT | Number of worker threads | 16 |
ONEFLOW_VM_PENDING_HANDLE_WINDOW_SIZE | Maximum size for fusing vm instructions | 10 |
ONEFLOW_VM_BLOCKING_DEBUG_INSTRUCTIONS_DISPLAY_LIMIT | Number of unprocessed instructions to be printed when vm execution times out | 1000 |
OneFlow Serving features have been upgraded to support additional backends, including OneFlow Python backend and OneFlow Lite backend, in addition to the existing support for the OneFlow Cpp backend.
For usage instructions, refer to: https://github.com/Oneflow-Inc/serving/blob/main/README.md
Optimized certain code implementations to accommodate CUDA 12.x. (https://github.com/Oneflow-Inc/oneflow/pull/10367)
Optimized the glu operator implementation to support bias-less inputs.(https://github.com/Oneflow-Inc/oneflow/pull/9874)
Optimized pooling operator implementation to support the channels_last parameter. (https://github.com/Oneflow-Inc/oneflow/pull/10242)
Optimized the flip operator implementation to address memory access inefficiencies when dim = -1. (https://github.com/Oneflow-Inc/oneflow/pull/10310)
Optimized the bincount operator implementation for accelerated performance. (https://github.com/Oneflow-Inc/oneflow/pull/10308)
Optimized the index_add operator implementation by dispatching varied logic based on index length to enhance performance for smaller indices.(https://github.com/Oneflow-Inc/oneflow/pull/9751)
Optimized the topk operator implementation to boost performance when batch size equals 1. (https://github.com/Oneflow-Inc/oneflow/pull/10009)
Optimized implementations of operators such as conv and arange to facilitate CUDA graph usage. (https://github.com/Oneflow-Inc/oneflow/pull/9761)
Optimized the upsample operator implementation to include input/output size validation.(https://github.com/Oneflow-Inc/oneflow/pull/9737)
Optimized the grouped_matmul_bias operator implementation by introducing tensor parallelism sbp derivation rules. (https://github.com/Oneflow-Inc/oneflow/pull/9934)
Optimized the reshape operator implementation with added nd sbp derivation rules. (https://github.com/Oneflow-Inc/oneflow/pull/9858)
Optimized error messages and completed test cases for mask_fill and in_top_k operators. (https://github.com/Oneflow-Inc/oneflow/pull/10062)
Optimized the higher-order differentiation rules for the tanh operator to optimize performance under third-order differentiation. (https://github.com/Oneflow-Inc/oneflow/pull/10188, https://github.com/Oneflow-Inc/oneflow/pull/10237)
Optimized conv interface implementation to support device and dtype parameters. (https://github.com/Oneflow-Inc/oneflow/pull/10228)
Optimized conv interface implementation to automatically expand input dimensions.(https://github.com/Oneflow-Inc/oneflow/pull/9721)
Optimized sum interface implementation to accommodate dtype parameters.(https://github.com/Oneflow-Inc/oneflow/pull/10204)
Optimized softmax interface implementation to support dtype parameters. (https://github.com/Oneflow-Inc/oneflow/pull/10069)
Optimized maxpool interface implementation to support 3D input tensors. (https://github.com/Oneflow-Inc/oneflow/pull/10110)
Optimized ctc_loss interface implementation parameters with PyTorch interface. (https://github.com/Oneflow-Inc/oneflow/pull/9887)
Optimized copy interface implementation to support scenarios where input and output have different devices and dtypes. (https://github.com/Oneflow-Inc/oneflow/pull/9888)
Optimized grad interface implementation to support the allow_unused parameter.(https://github.com/Oneflow-Inc/oneflow/pull/10251)
Optimized load interface implementation to provide more user-friendly error messages.(https://github.com/Oneflow-Inc/oneflow/pull/10138)
Optimized fused_matmul_bias operator and interface implementation to support alpha and beta parameters. (https://github.com/Oneflow-Inc/oneflow/pull/10015)
Optimized normal operator and interface implementation to align behavior with PyTorch. (https://github.com/Oneflow-Inc/oneflow/pull/10185)
Optimized fused attention operator and interface implementation to allow None for pasti_key and past_value. (https://github.com/Oneflow-Inc/oneflow/pull/9977)
Optimized fused_attention operator and interface implementation to add support for variable sequence lengths. (https://github.com/Oneflow-Inc/oneflow/pull/9991)
Optimized fused_multi_head_attention_inference operator and interface implementation to include attn_bias parameter. (https://github.com/Oneflow-Inc/oneflow/pull/9853)
Optimized bn-related functor implementation. Merging bn_add_relu and bn_relu operations to expedite inference. (https://github.com/Oneflow-Inc/oneflow/pull/10239)
Optimized MLIR CodeGen-based processes and upgraded LLVM version to 16.0.0. (https://github.com/Oneflow-Inc/oneflow/pull/9985)
Optimized MLIR codegen-based processes by adding AppendOneFlowStream, MgpuToOneFlowStream, and CastOneFlowInputToSignlessPass passes. (https://github.com/Oneflow-Inc/oneflow/pull/10149, https://github.com/Oneflow-Inc/oneflow/pull/10151, https://github.com/Oneflow-Inc/oneflow/pull/10099)
Optimized MLIR codegen-based processes by linking LibDevice to support NVVM IR conversion to cubin. (https://github.com/Oneflow-Inc/oneflow/pull/10200)
Optimized MLIR codegen-based processes by utilizing tmpbuffer as MemPool in MLIR. (Oneflow-Inc/oneflow#10159)
Optimized MLIR codegen-based processes by enabling bufferizable operator dispatch. (https://github.com/Oneflow-Inc/oneflow/pull/9787)
Optimized MLIR codegen-based processes to expedite ofmempool and related processes. (https://github.com/Oneflow-Inc/oneflow/pull/10152, https://github.com/Oneflow-Inc/oneflow/pull/10168, https://github.com/Oneflow-Inc/oneflow/pull/10184, https://github.com/Oneflow-Inc/oneflow/pull/10239)
Optimized stacktrace call stack information.(https://github.com/Oneflow-Inc/oneflow/pull/9912, https://github.com/Oneflow-Inc/oneflow/pull/9937, https://github.com/Oneflow-Inc/oneflow/pull/10260, https://github.com/Oneflow-Inc/oneflow/pull/10161)
Optimized random number generator implementation by adding caching to avoid regeneration with each call. (https://github.com/Oneflow-Inc/oneflow/pull/10387)
Optimized graph load functionality to support loading the graph onto a new device.(https://github.com/Oneflow-Inc/oneflow/pull/10335)
Optimized dummy array initialization implementation using fold expressions. (https://github.com/Oneflow-Inc/oneflow/pull/10271)
Optimized MemoryFormat class organization, exposed to Python layer via cpython to support changing tensor's MemoryFormat using Tensor.to interface. (https://github.com/Oneflow-Inc/oneflow/pull/10181)
Optimized implementations of steam, device, and vm to support more device types. (https://github.com/Oneflow-Inc/oneflow/pull/10166)
Optimized error messages for MapAt, adding printing of key values.(https://github.com/Oneflow-Inc/oneflow/pull/10090)
Optimized OOM error messages to differentiate CUDA and CPU devices and display size. (https://github.com/Oneflow-Inc/oneflow/pull/9938)
Optimized error messages for CHECK_XX_OR_RETURN macros. (https://github.com/Oneflow-Inc/oneflow/pull/9921)
Optimized error messages for graph-related issues. (https://github.com/Oneflow-Inc/oneflow/pull/9821)
Optimized error messages for convolution operator-related issues. (https://github.com/Oneflow-Inc/oneflow/pull/9707)
Optimized model initialization to minimize additional overhead. (https://github.com/Oneflow-Inc/oneflow/pull/10088)
Optimized thread manager implementation to accommodate three usage scenarios: unrestricted threads, master as a thread, and n threads. (https://github.com/Oneflow-Inc/oneflow/pull/10060)
Optimized numpy array release mechanism to release in the main thread to reduce time-consuming GIL requests. (https://github.com/Oneflow-Inc/oneflow/pull/10050)
Optimized graph save runtime_state_dict implementation to enhance performance and address related issues. (https://github.com/Oneflow-Inc/oneflow/pull/10016)
Optimized parsing of different calling methods for interfaces like Tensor.foo(*args) using a unified PyParseArgs function. (https://github.com/Oneflow-Inc/oneflow/pull/9983)
Optimized the implementation of the ArgsTree class to support arbitrary output types and conducted file location migration. (https://github.com/Oneflow-Inc/oneflow/pull/9846)
Optimized memory allocation mechanism to achieve ordered allocation based on streams. (https://github.com/Oneflow-Inc/oneflow/pull/9818)
Removed deallocate context. (https://github.com/Oneflow-Inc/oneflow/pull/10143)
Removed debug compilation mode in graph compilation. (https://github.com/Oneflow-Inc/oneflow/pull/10145)
Removed unused logic for MemChain merge. (https://github.com/Oneflow-Inc/oneflow/pull/10097)
Removed default settings for some unused distributed environment variables. (https://github.com/Oneflow-Inc/oneflow/pull/9803)
Refactored collective boxing implementation under lazy mode. (https://github.com/Oneflow-Inc/oneflow/pull/10098)
Refactored registration of EagerCclS2S.(https://github.com/Oneflow-Inc/oneflow/pull/10100)
Refactored implementation of collective_boxing_executor_backend. (https://github.com/Oneflow-Inc/oneflow/pull/10082)
Refactored implementation of running global nn.graph using VM. (https://github.com/Oneflow-Inc/oneflow/pull/10048)
Refactored implementation of local to global related interfaces.(https://github.com/Oneflow-Inc/oneflow/pull/9870)
Refactored operator dispatch dialect implementation in MLIR codegen process. (https://github.com/Oneflow-Inc/oneflow/pull/9693)
Refactored implementation of random generator and distribution kernels. (https://github.com/Oneflow-Inc/oneflow/pull/9691)
Refactored implementation of fast_atomic_add operator. (https://github.com/Oneflow-Inc/oneflow/pull/9680)
Refactored error check related macros in glog. (https://github.com/Oneflow-Inc/oneflow/pull/10176)
Refactored implementation of random generator. (https://github.com/Oneflow-Inc/oneflow/pull/10025)
Refactored implementation of some elementwise primitive operations. (https://github.com/Oneflow-Inc/oneflow/pull/9857)
Refactored code related to device descriptions. (https://github.com/Oneflow-Inc/oneflow/pull/9791)
Refactored implementation of ParseDeviceString and ParseDeviceNameConf. (https://github.com/Oneflow-Inc/oneflow/pull/9833)
Refactored implementation of ActorMsg related functionalities, introducing IBVerbsActorMsgWrapper wrapper to reduce the size of ActorMsg. (https://github.com/Oneflow-Inc/oneflow/pull/9762)
Refactored implementation of save and load interfaces, migrating the method of saving graphs to the _save_graph function, adding some _open* helper classes to differentiate between paths and memory, enabling saving weights to BytesIO in save, and supporting file streaming in load. (https://github.com/Oneflow-Inc/oneflow/pull/10021)
Refactored implementation of some tensor-related interfaces, migrating code from Python layer to C++ layer. (https://github.com/Oneflow-Inc/oneflow/pull/9990, https://github.com/Oneflow-Inc/oneflow/pull/9964)
Upgraded PyBind version used in the project to 2.11.1. (https://github.com/Oneflow-Inc/oneflow/pull/10391)
Fixed default dynamic linking settings in CMake files to avoid LLVM15 linking errors. (https://github.com/Oneflow-Inc/oneflow/pull/10373, https://github.com/Oneflow-Inc/oneflow/pull/10131)
Fixed cast-related bugs in MLIR codegen. (https://github.com/Oneflow-Inc/oneflow/pull/10105)
Fixed logic handling for cpg attr in Module._apply function. (https://github.com/Oneflow-Inc/oneflow/pull/10343)
Fixed inheritance issue for DummyModule when attr is mro_entries. (https://github.com/Oneflow-Inc/oneflow/pull/9976)
Fixed size checking issue for _handle_size_arg in full op. (https://github.com/Oneflow-Inc/oneflow/pull/9975)
Fixed residual environment variables after launching mock via command line, causing subsequent API mock parameter errors. (https://github.com/Oneflow-Inc/oneflow/pull/9970)
Fixed inability to exit when two processes encounter exceptions. (https://github.com/Oneflow-Inc/oneflow/pull/10054)
Fixed bug in grouped quantization sbp derivation. (https://github.com/Oneflow-Inc/oneflow/pull/10132)
Fixed kMaxInputCount check issue in GroupedMatmulFunctor. (https://github.com/Oneflow-Inc/oneflow/pull/10322)
Fixed 0-size tensor broadcast issue.(https://github.com/Oneflow-Inc/oneflow/pull/10186)
Fixed issue where double type attr was not updated when using shared_graph. (https://github.com/Oneflow-Inc/oneflow/pull/10279)
Fixed data type error in GetItemInScalarTensor. (https://github.com/Oneflow-Inc/oneflow/pull/10226)
Fixed gradient issue in GroupNorm, calling GroupNormParamGrad only when gamma and beta gradients are required. (https://github.com/Oneflow-Inc/oneflow/pull/10045)
Fixed error when reading tensors with partial ranks in global mode. (https://github.com/Oneflow-Inc/oneflow/pull/10056)
Fixed control boundary issues in checkpointing under PP, affecting task graph construction under separate compilation. (https://github.com/Oneflow-Inc/oneflow/pull/10057)
Fixed bug when using 3D parallelism and enabling activation checkpointing simultaneously. (https://github.com/Oneflow-Inc/oneflow/pull/10031)
Fixed adaptation bug of AutoMixedPrecision pass on non-CUDA devices and bug related to device combinations in LayerNorm Module. (https://github.com/Oneflow-Inc/oneflow/pull/10026)
Fixed default value setting issue for reduce parameter in scatter operator. (https://github.com/Oneflow-Inc/oneflow/pull/10002)
Fixed incomplete disable of some Torch variables in mock.disable, causing lingering references in other globals. (https://github.com/Oneflow-Inc/oneflow/pull/9989)
Fixed destructor issue in vm::TensorStorage. (https://github.com/Oneflow-Inc/oneflow/pull/9962)
Fixed offload issue where small tensors were not released from CUDA memory.(https://github.com/Oneflow-Inc/oneflow/pull/9974)
Fixed occasional segmentation fault in Python stack getter due to thread unsafety.(https://github.com/Oneflow-Inc/oneflow/pull/9955)
Fixed element lookup issue in set under separate compilation scenario. (https://github.com/Oneflow-Inc/oneflow/pull/9952)
Aligned qkv and output_layout in fused_multi_head_attention operator. (https://github.com/Oneflow-Inc/oneflow/pull/9950)
Fixed inconsistency in seed behavior of random series operators between graph and checkpointing. (https://github.com/Oneflow-Inc/oneflow/pull/9941)
Fixed parameter reload failure issue in Eager mode. (https://github.com/Oneflow-Inc/oneflow/pull/9935)
Fixed infinite loop issue in specific cases of mock torch lazy functionality. (https://github.com/Oneflow-Inc/oneflow/pull/9926)
Fixed issue where code in stft_kernel.cu file was not compiled by default. (Oneflow-Inc/oneflow#9922)
Fixed deadlock and memory allocation errors caused by invalid topological order due to incomplete TaskGraph under separate compilation in order_in_graph. (https://github.com/Oneflow-Inc/oneflow/pull/9909 )
Fixed xrt compilation issue where fmt could not be found. (https://github.com/Oneflow-Inc/oneflow/pull/9894)
Fixed imbalance in GPU memory allocation among processes during local to global process where sbp is B. (https://github.com/Oneflow-Inc/oneflow/pull/9852)
Aligned OneFlow and PyTorch behaviors related to the third parameter of CTCLoss. (https://github.com/Oneflow-Inc/oneflow/pull/9845)
Fixed initialization issues related to thread_global_id and rank_group_scope. (https://github.com/Oneflow-Inc/oneflow/pull/9841)
Fixed inplace handling errors in dropout operator implementation. (https://github.com/Oneflow-Inc/oneflow/pull/9808)
Fixed errors in loading non-tensor objects saved by PyTorch in the load function. (https://github.com/Oneflow-Inc/oneflow/pull/9804)
Fixed conflicts between contiguous memory and GPU memory allocation strategies. (https://github.com/Oneflow-Inc/oneflow/pull/9786)
Fixed memory allocation issues in EagerBlobObject::ByteSizeOfBlobBody when considering non-contiguous cases. (https://github.com/Oneflow-Inc/oneflow/pull/9782)
Fixed dtype inference errors in fill_ operator during autocast. (https://github.com/Oneflow-Inc/oneflow/pull/9776)
Fixed sbp derivation rule issues in fused_glu operator. (https://github.com/Oneflow-Inc/oneflow/pull/10108)
Fixed issues related to calling nn.Graph.__map_io. (https://github.com/Oneflow-Inc/oneflow/pull/10084)
Fixed inconsistency between set_grad_mode interface and PyTorch behavior. (https://github.com/Oneflow-Inc/oneflow/pull/10059)
Fixed an issue related to the map_location parameter in the load interface and added support for passing lambda functions. (https://github.com/Oneflow-Inc/oneflow/pull/10052)
Fixed stride inference errors after unsqueeze operation in view mode. (https://github.com/Oneflow-Inc/oneflow/pull/9775)
Fixed problems in conv op with unbatched input and bias, and added support for unbatched input in deconv op. (https://github.com/Oneflow-Inc/oneflow/pull/9740)
Fixed logic errors in trunc_normal_ implementation. (https://github.com/Oneflow-Inc/oneflow/pull/9711)
Fixed default value issue in dim parameter of topk operator. (https://github.com/Oneflow-Inc/oneflow/pull/9703)
Fixed issues where placement of some networks was incorrectly set to CPU during static graph printing. (https://github.com/Oneflow-Inc/oneflow/pull/9770)
Fixed conflict between include paths of trt_flash_attention and native flash attention. (https://github.com/Oneflow-Inc/oneflow/pull/9750)
Fixed segmentation fault caused by is_shutting_down and gil in stack getter. (https://github.com/Oneflow-Inc/oneflow/pull/9681)
Fixed issues related to the separate compilation feature found in distributed unit testing.(https://github.com/Oneflow-Inc/oneflow/pull/9749)
Fixed memory handling issues in flatten algorithm implementation. (https://github.com/Oneflow-Inc/oneflow/pull/9746)
Fixed a deadlock issue in the execution flow. (https://github.com/Oneflow-Inc/oneflow/pull/9738)
Fixed errors in isinstance check for DummyModule. (https://github.com/Oneflow-Inc/oneflow/pull/10207)
Corrected behavior where default size was erroneously overridden when introducing llvm::SmallVector. (https://github.com/Oneflow-Inc/oneflow/pull/9932)
Fixed errors in calculating memory size of non-contiguous memory tensors. (https://github.com/Oneflow-Inc/oneflow/pull/9819)
Fixed issues with calling CHECK_JUST in the TensorStorage destructor function. (https://github.com/Oneflow-Inc/oneflow/pull/9752)
Compile and execute the backbone parts of ResNet50 and Faster RCNN models using OneFlow compile_from_torch and PyTorch compile interfaces to test the inference performance with inputs of different shapes. The results are shown in the table below:
Model | input shape | PyTorch compile | OneFlow compile_from_torch | dynamic | test timing |
---|---|---|---|---|---|
ResNet50 | (1, 3, 512, 512) | 21.328 s | 3.205 s | False | initial compilation and execution |
ResNet50 | (2, 3, 896, 512) | 14.167 s | 1.523 s | False | continuous compilation and execution |
ResNet50 | (2, 3, 512, 896) | 13.364 s | 1.402 s | False | continuous compilation and execution |
ResNet50 | (3, 3, 896, 896) | 15.056 s | 1.539 s | False | continuous compilation and execution |
ResNet50 | (2, 3, 1024, 896) | 14.167 s | 1.500 s | False | continuous compilation and execution |
ResNet50 | (2, 3, 896, 1024) | 12.891 s | 1.494 s | False | continuous compilation and execution |
ResNet50 | (6, 3, 1024, 1024) | 14.859 s | 1.872 s | False | continuous compilation and execution |
ResNet50 | (1, 3, 512, 512) | 170.446 s | 3.143 s | True | initial compilation and execution |
ResNet50 | (2, 3, 896, 512) | 185.672 s | 0.851 s | True | continuous compilation and execution |
ResNet50 | (2, 3, 512, 896) | 0.089 s | 0.836 s | True | continuous compilation and execution |
ResNet50 | (3, 3, 896, 896) | 0.084 s | 0.980 s | True | continuous compilation and execution |
ResNet50 | (2, 3, 1024, 896) | 0.077 s | 0.942 s | True | continuous compilation and execution |
ResNet50 | (2, 3, 896, 1024) | 0.080 s | 0.931 s | True | continuous compilation and execution |
ResNet50 | (6, 3, 1024, 1024) | 0.084 s | 1.406 s | True | continuous compilation and execution |
Faster RCNN | (1, 3, 512, 512) | 18.224 s | 5.483 s | False | initial compilation and execution |
Faster RCNN | (2, 3, 896, 512) | 9.200 s | 3.011 s | False | continuous compilation and execution |
Faster RCNN | (2, 3, 512, 896) | 9.331 s | 3.025 s | False | continuous compilation and execution |
Faster RCNN | (3, 3, 896, 896) | 9.301 s | 2.854 s | False | continuous compilation and execution |
Faster RCNN | (2, 3, 1024, 896) | 9.290 s | 2.805 s | False | continuous compilation and execution |
Faster RCNN | (2, 3, 896, 1024) | 9.123 s | 2.851 s | False | continuous compilation and execution |
Faster RCNN | (6, 3, 1024, 1024) | 9.377 s | 3.180 s | False | continuous compilation and execution |
Faster RCNN | (1, 3, 512, 512) | 25.444 s | 5.430 s | True | initial compilation and execution |
Faster RCNN | (2, 3, 896, 512) | 25.381 s | 1.899 s | True | continuous compilation and execution |
Faster RCNN | (2, 3, 512, 896) | 0.116 s | 1.886 s | True | continuous compilation and execution |
Faster RCNN | (3, 3, 896, 896) | 1.982 s | 1.793 s | True | continuous compilation and execution |
Faster RCNN | (2, 3, 1024, 896) | 0.114 s | 1.803 s | True | continuous compilation and execution |
Faster RCNN | (2, 3, 896, 1024) | 0.111 s | 1.778 s | True | continuous compilation and execution |
Faster RCNN | (6, 3, 1024, 1024) | 0.143 s | 2.110 s | True | continuous compilation and execution |
Using the OneFlow compile_from_torch and PyTorch compile interfaces, the unet section of the Stable Diffusion model was compiled and executed to test the inference performance with outputs of different shapes. The results are presented in the table below:
Model | Output shape | PyTorch compile | OneFlow compile_from_torch | dynamic | test timing |
---|---|---|---|---|---|
Stable Diffusion | (2, 512, 512) | 103.701 s | 63.670 s | False | initial compilation and execution |
Stable Diffusion | (1, 512, 768) | 95.137 s | 53.864 s | False | continuous compilation and execution |
Stable Diffusion | (2, 768, 512) | 90.259 s | 55.271 s | False | continuous compilation and execution |
Stable Diffusion | (1, 768, 768) | 90.196 s | 51.590 s | False | continuous compilation and execution |
Stable Diffusion | (2, 512, 512) | 275.660 s | 57.117 s | True | initial compilation and execution |
Stable Diffusion | (1, 512, 768) | 345.774 s | 43.752 s | True | continuous compilation and execution |
Stable Diffusion | (2, 768, 512) | 349.835 s | 47.653 s | True | continuous compilation and execution |
Stable Diffusion | (1, 768, 768) | 7.224 s | 45.720 s | True | continuous compilation and execution |
Stable Diffusion | (2, 512, 512) | 4.088 s | 2.831 s | False | subsequent execution |
Stable Diffusion | (1, 512, 768) | 3.296 s | 2.325 s | False | subsequent execution |
Stable Diffusion | (2, 768, 512) | 5.594 s | 5.157 s | False | subsequent execution |
Stable Diffusion | (1, 768, 768) | 4.713 s | 3.557 s | False | subsequent execution |
Stable Diffusion | (2, 512, 512) | 4.448 s | 2.801 s | True | subsequent execution |
Stable Diffusion | (1, 512, 768) | 3.201 s | 2.314 s | True | subsequent execution |
Stable Diffusion | (2, 768, 512) | 6.093 s | 4.166 s | True | subsequent execution |
Stable Diffusion | (1, 768, 768) | 4.920 s | 3.557 s | True | subsequent execution |
Conclusion: The OneFlow compile_from_torch interface generally has shorter compilation times compared to the PyTorch compile interface. Additionally, benefiting from the exceptional operator optimizations in the OneFlow framework, there is superior execution performance on the Stable Diffusion model.
Note: The tests were conducted with GPU 3090, PyTorch v2.1.2 and CUDA 12.2.
Model | GPU model | number of GPUs | macro batch | PyTorch performance(iter/s) | OneFlow performance(iter/s) | speedup ratio |
---|---|---|---|---|---|---|
ResNet50 | 3090 | 1 | 1 | 31.37 | 38.81 | 23.72% |
ResNet50 | 3090 | 1 | 2 | 32.06 | 48.45 | 51.12% |
ResNet50 | 3090 | 2 | 1 | 31.10 | 33.46 | 7.59% |
ResNet50 | 3090 | 2 | 2 | 31.76 | 34.83 | 9.67% |
ResNet50 | A100 | 1 | 1 | 24.60 | 46.64 | 89.59% |
ResNet50 | A100 | 1 | 2 | 25.06 | 49.88 | 99.04% |
ResNet50 | A100 | 2 | 1 | 25.28 | 39.18 | 54.98% |
ResNet50 | A100 | 2 | 2 | 24.09 | 32.84 | 36.32% |
Bert | 3090 | 1 | 1 | 8.93 | 10.41 | 16.57% |
Bert | 3090 | 1 | 2 | 13.11 | 14.31 | 9.15% |
Bert | 3090 | 2 | 1 | 6.94 | 8.27 | 19.16% |
Bert | 3090 | 2 | 2 | 12.19 | 15.58 | 27.81% |
Bert | A100 | 1 | 1 | 10.45 | 12.72 | 21.72% |
Bert | A100 | 1 | 2 | 20.24 | 21.57 | 6.57% |
Bert | A100 | 2 | 1 | 12.63 | 16.09 | 27.39% |
Bert | A100 | 2 | 2 | 24.86 | 29.84 | 20.03% |
Conclusion: Compared to PyTorch Eager, using OneFlow Eager shows significant performance advantages in small batch scenarios for both ResNet50 and BERT models.
Note: The tests were conducted using PyTorch v2.1.0 and CUDA 12.1.
OneFlow 发布 v1.0.0 版本, 欢迎大家安装使用。
本次版本更新包含 447 个 commits 和如下重点内容:
发布新接口 compile_from_torch
。该接口在共享参数显存的情况下,将 PyTorch 的 Module 实例转化成 OneFlow 的 Module 实例,支持直接 Eager 运行或者转化为静态图 nn.Graph 并进一步使用 MLIR 编译加速。该接口仍在快速演进中,目前支持了动态形状编译并在ResNet50、Faster RCNN、Stable Diffusion三个典型模型上做了验证。
对 Eager 运行时做了一系列优化与重构,包括统一系统内存池、对接 CUDA 原生接口、优化指令调度机制、引入指令融合机制、优化 Autograd 构图速度、优化 Op 推导过程、解耦 Instruction 与 Stream 等。
静态图分布式物理执行计划支持分离编译功能,每个进程独立编译自己所需的执行计划,使得编译时间不再随 GPU 规模线性增长。
新增一系列函数式自动微分相关接口支持,包括 jvp、vjp、hvp、vhp、jacobian、hessian。
新增 Insight 模块,支持可视化地展示埋点区间内 kernel 调用、执行时间、速度等信息。
大规模模型训练开源工具箱 LiBai 版本更新,原生支持大语言模型 Llama2 和 ChatGLM2 的 finetune 和分布式推理,支持 full finetune、adapter finetune、lora finetune,可使用 lm-eval-harness 对语言模型进行评测验证。
OneFlow Serving 功能升级,在原有支持 OneFlow Cpp 后端的基础上,新增支持 OneFlow Python 后端和 OneFlow Lite 后端。
compile_from_torch
接口在共享参数显存的情况下,将 PyTorch 的 Module 实例转化成 OneFlow 的 Module 实例,支持直接 Eager 运行或者转化为静态图 nn.Graph 并进一步使用 MLIR 编译加速。(https://github.com/Oneflow-Inc/oneflow/pull/10404, https://github.com/Oneflow-Inc/oneflow/pull/10408, https://github.com/Oneflow-Inc/oneflow/pull/9984, https://github.com/Oneflow-Inc/oneflow/pull/9754)
接口签名及参数介绍:
compile_from_torch(torch_module: torch.nn.Module, \*, use_graph=True, options={})
* torch_module:需要被转换的 Torch Module 实例。
* use_graph:是否转化为静态图 nn.Graph 并使用 MLIR 编译加速,默认为 True。
* options:
* size: 使用静态图 nn.Graph 后会根据输入的 shape 计算 hash 值缓存相应的 graph ,size 表示静态图缓存的最大容量,超过最大容量会根据 LRU 策略对 graph 进行清理,默认值为 9。
* dynamic:对于动态 shape 的输入第一次会完整编译 graph,之后的对于不同 shape 的输入当 dynamic 为 True 时会启用共享图进行编译加速,dynamic 为 False 时每次都会重新进行编译,默认为 True。
* debug:调试模式和日志级别设置,-1 禁用调试模式,0 输出警告和静态图构建信息,1 额外输出每个子模块的构图信息,2 额外输出每个算子的进度,3 输出更详细的算子信息,默认为 -1。
使用示例:
import torch
from torchvision import models
import oneflow
from oneflow.framework.infer_compiler import compile_from_torch
DEVICE = torch.device("cuda")
WEIGHT = models.ResNet50_Weights.DEFAULT
model = models.resnet50(weights=WEIGHT).to(DEVICE)
compile_model = compile_from_torch(model, options={"dynamic": True})
静态图分布式物理执行计划支持分离编译功能,每个进程独立编译自己所需的执行计划,使得编译时间不再随 GPU 规模线性增长。分离编译功能支持 3D 混合并行(数据并行+模型并行+流水并行)场景,可与大规模模型训练开源工具箱 LiBai 一同使用,打开方式为:export ONEFLOW_ENABLE_LAZY_SEPARATE_COMPILE=1
。(https://github.com/Oneflow-Inc/oneflow/pull/9920, https://github.com/Oneflow-Inc/oneflow/pull/10140, https://github.com/Oneflow-Inc/oneflow/pull/10141, https://github.com/Oneflow-Inc/oneflow/pull/10124, https://github.com/Oneflow-Inc/oneflow/pull/10102)
以下是在 128 卡 A100-PCIE-40GB 设备上,配合 LiBai 在 GPT2 模型上的测试结果:
并行方式 | 是否开启分离编译 | 执行计划编译时间 |
---|---|---|
数据并行 (DP128 MP1 PP1) | 否 | 超过 20 minutes |
数据并行 (DP128 MP1 PP1) | 是 | 108.21 s |
3D 并行 (DP4 MP4 PP8) | 否 | 445.16 s |
3D 并行 (DP4 MP4 PP8) | 是 | 82.88 s |
新增一系列函数式自动微分相关接口支持,包括 jvp、vjp、hvp、vhp、jacobian、hessian。(https://github.com/Oneflow-Inc/oneflow/pull/10412, https://github.com/Oneflow-Inc/oneflow/pull/10428)
使用示例:
import oneflow as flow
# jacobian example
def exp_reducer(x):
return x.exp().sum(dim=1)
input = flow.rand(2, 2)
jac_rslt = flow.autograd.functional.jacobian(exp_reducer, input)
# vhp example
def pow_reducer(x):
return x.pow(3).sum()
input = flow.rand(2, 2)
v = flow.ones(2, 2)
vhp_rslt = flow.autograd.functional.vhp(pow_reducer, input, v)
新增 Insight 模块,支持可视化地展示埋点区间内 kernel 调用、执行时间、速度等信息。(https://github.com/Oneflow-Inc/oneflow/pull/10370)
使用方法如下:
更详细的介绍可参考:https://github.com/Oneflow-Inc/oneflow/tree/master/python/oneflow/utils/insight#usage
大规模模型训练开源工具箱 LiBai 功能升级,发布新版本 v0.3.0,原生支持大语言模型 Llama2 和 ChatGLM2 的 finetune 和分布式推理,支持 full finetune、adapter finetune、lora finetune,可使用 lm-eval-harness 对语言模型进行评测验证。
ChatGLM 和 Llama2 的分布式训练和推理支持情况如下:
使用示例:
# full finetune
bash tools/train.sh projects/Llama/train_net.py projects/Llama/configs/llama_sft.py 8
# adapter finetune
bash tools/train.sh projects/Llama/adapter/train_net.py projects/Llama/adapter/adapter_sft.py 8
# inference
bash tools/infer.sh projects/Llama/pipeline.py 8
# eval
python projects/Llama/utils/eval_adapter.py
对 Eager 运行时做了一系列优化与重构,主要包括:
可以通过一些环境变量设定 Eager 运行时行为:
环境变量 | 意义 | 默认值 |
---|---|---|
ONEFLOW_VM_COMPUTE_ON_WORKER_THREAD | 是否在 worker 线程上完成计算 | true |
ONEFLOW_VM_MULTI_THREAD | 是否使用多线程协同执行 Eager 运算 | true |
ONEFLOW_VM_ENABLE_STREAM_WAIT | 多 stream 间的依赖是否使用 stream_wait 机制 | true |
ONEFLOW_VM_ENABLE_SCHEDULE_YIELD | 是否使用 yield 机制减少 scheduler 线程 busy wait 程度 | true |
ONEFLOW_EAGER_ENABLE_LOCAL_INFER_CACHE | 计算过程中是否缓存算子输出的元信息 | true |
ONEFLOW_VM_WORKER_THREAD_LIMIT | worker 线程的个数 | 16 |
ONEFLOW_VM_PENDING_HANDLE_WINDOW_SIZE | vm 融合指令的最大 size | 10 |
ONEFLOW_VM_BLOCKING_DEBUG_INSTRUCTIONS_DISPLAY_LIMIT | vm 执行超时时打印未处理指令的数量 | 1000 |
OneFlow Serving 功能升级,在原有支持 OneFlow Cpp 后端的基础上,新增支持 OneFlow Python 后端和 OneFlow Lite 后端。
使用方法参考:https://github.com/Oneflow-Inc/serving/blob/main/README.md
对 ResNet50 模型和 Faster RCNN 模型的 backbone 部分使用 OneFlow compile_from_torch 和 PyTorch compile 接口进行编译并执行,测试不同 shape 输入时的推理性能,结果如下表:
模型 | 输入 shape | PyTorch compile | OneFlow compile_from_torch | dynamic | 测试时机 |
---|---|---|---|---|---|
ResNet50 | (1, 3, 512, 512) | 21.328 s | 3.205 s | False | 首次编译执行 |
ResNet50 | (2, 3, 896, 512) | 14.167 s | 1.523 s | False | 连续编译执行 |
ResNet50 | (2, 3, 512, 896) | 13.364 s | 1.402 s | False | 连续编译执行 |
ResNet50 | (3, 3, 896, 896) | 15.056 s | 1.539 s | False | 连续编译执行 |
ResNet50 | (2, 3, 1024, 896) | 14.167 s | 1.500 s | False | 连续编译执行 |
ResNet50 | (2, 3, 896, 1024) | 12.891 s | 1.494 s | False | 连续编译执行 |
ResNet50 | (6, 3, 1024, 1024) | 14.859 s | 1.872 s | False | 连续编译执行 |
ResNet50 | (1, 3, 512, 512) | 170.446 s | 3.143 s | True | 首次编译执行 |
ResNet50 | (2, 3, 896, 512) | 185.672 s | 0.851 s | True | 连续编译执行 |
ResNet50 | (2, 3, 512, 896) | 0.089 s | 0.836 s | True | 连续编译执行 |
ResNet50 | (3, 3, 896, 896) | 0.084 s | 0.980 s | True | 连续编译执行 |
ResNet50 | (2, 3, 1024, 896) | 0.077 s | 0.942 s | True | 连续编译执行 |
ResNet50 | (2, 3, 896, 1024) | 0.080 s | 0.931 s | True | 连续编译执行 |
ResNet50 | (6, 3, 1024, 1024) | 0.084 s | 1.406 s | True | 连续编译执行 |
Faster RCNN | (1, 3, 512, 512) | 18.224 s | 5.483 s | False | 首次编译执行 |
Faster RCNN | (2, 3, 896, 512) | 9.200 s | 3.011 s | False | 连续编译执行 |
Faster RCNN | (2, 3, 512, 896) | 9.331 s | 3.025 s | False | 连续编译执行 |
Faster RCNN | (3, 3, 896, 896) | 9.301 s | 2.854 s | False | 连续编译执行 |
Faster RCNN | (2, 3, 1024, 896) | 9.290 s | 2.805 s | False | 连续编译执行 |
Faster RCNN | (2, 3, 896, 1024) | 9.123 s | 2.851 s | False | 连续编译执行 |
Faster RCNN | (6, 3, 1024, 1024) | 9.377 s | 3.180 s | False | 连续编译执行 |
Faster RCNN | (1, 3, 512, 512) | 25.444 s | 5.430 s | True | 首次编译执行 |
Faster RCNN | (2, 3, 896, 512) | 25.381 s | 1.899 s | True | 连续编译执行 |
Faster RCNN | (2, 3, 512, 896) | 0.116 s | 1.886 s | True | 连续编译执行 |
Faster RCNN | (3, 3, 896, 896) | 1.982 s | 1.793 s | True | 连续编译执行 |
Faster RCNN | (2, 3, 1024, 896) | 0.114 s | 1.803 s | True | 连续编译执行 |
Faster RCNN | (2, 3, 896, 1024) | 0.111 s | 1.778 s | True | 连续编译执行 |
Faster RCNN | (6, 3, 1024, 1024) | 0.143 s | 2.110 s | True | 连续编译执行 |
对 Stable Diffusion 模型的 unet 部分使用 OneFlow compile_from_torch 和 PyTorch compile 接口进行编译并执行,测试不同 shape 输出时的推理性能,结果如下表:
模型 | 输出 shape | PyTorch compile | OneFlow compile_from_torch | dynamic | 测试时机 |
---|---|---|---|---|---|
Stable Diffusion | (2, 512, 512) | 103.701 s | 63.670 s | False | 首次编译执行 |
Stable Diffusion | (1, 512, 768) | 95.137 s | 53.864 s | False | 连续编译执行 |
Stable Diffusion | (2, 768, 512) | 90.259 s | 55.271 s | False | 连续编译执行 |
Stable Diffusion | (1, 768, 768) | 90.196 s | 51.590 s | False | 连续编译执行 |
Stable Diffusion | (2, 512, 512) | 275.660 s | 57.117 s | True | 首次编译执行 |
Stable Diffusion | (1, 512, 768) | 345.774 s | 43.752 s | True | 连续编译执行 |
Stable Diffusion | (2, 768, 512) | 349.835 s | 47.653 s | True | 连续编译执行 |
Stable Diffusion | (1, 768, 768) | 7.224 s | 45.720 s | True | 连续编译执行 |
Stable Diffusion | (2, 512, 512) | 4.088 s | 2.831 s | False | 后续执行 |
Stable Diffusion | (1, 512, 768) | 3.296 s | 2.325 s | False | 后续执行 |
Stable Diffusion | (2, 768, 512) | 5.594 s | 5.157 s | False | 后续执行 |
Stable Diffusion | (1, 768, 768) | 4.713 s | 3.557 s | False | 后续执行 |
Stable Diffusion | (2, 512, 512) | 4.448 s | 2.801 s | True | 后续执行 |
Stable Diffusion | (1, 512, 768) | 3.201 s | 2.314 s | True | 后续执行 |
Stable Diffusion | (2, 768, 512) | 6.093 s | 4.166 s | True | 后续执行 |
Stable Diffusion | (1, 768, 768) | 4.920 s | 3.557 s | True | 后续执行 |
结论:使用 OneFlow compile_from_torch 接口有相对于 PyTorch compile 接口平均更短的编译时间,另外得益于 OneFlow 框架中极致的算子优化,在 Stable Diffusion 模型上有更优的执行性能。
备注:测试使用 GPU 型号为 3090,PyTorch 版本为 v2.1.2,cuda 版本为 12.2。
模型 | GPU 型号 | 卡数 | macro batch | PyTorch 性能(iter/s) | OneFlow 性能(iter/s) | 加速比 |
---|---|---|---|---|---|---|
ResNet50 | 3090 | 1 | 1 | 31.37 | 38.81 | 23.72% |
ResNet50 | 3090 | 1 | 2 | 32.06 | 48.45 | 51.12% |
ResNet50 | 3090 | 2 | 1 | 31.10 | 33.46 | 7.59% |
ResNet50 | 3090 | 2 | 2 | 31.76 | 34.83 | 9.67% |
ResNet50 | A100 | 1 | 1 | 24.60 | 46.64 | 89.59% |
ResNet50 | A100 | 1 | 2 | 25.06 | 49.88 | 99.04% |
ResNet50 | A100 | 2 | 1 | 25.28 | 39.18 | 54.98% |
ResNet50 | A100 | 2 | 2 | 24.09 | 32.84 | 36.32% |
Bert | 3090 | 1 | 1 | 8.93 | 10.41 | 16.57% |
Bert | 3090 | 1 | 2 | 13.11 | 14.31 | 9.15% |
Bert | 3090 | 2 | 1 | 6.94 | 8.27 | 19.16% |
Bert | 3090 | 2 | 2 | 12.19 | 15.58 | 27.81% |
Bert | A100 | 1 | 1 | 10.45 | 12.72 | 21.72% |
Bert | A100 | 1 | 2 | 20.24 | 21.57 | 6.57% |
Bert | A100 | 2 | 1 | 12.63 | 16.09 | 27.39% |
Bert | A100 | 2 | 2 | 24.86 | 29.84 | 20.03% |
结论:使用 OneFlow Eager 相对于 PyTorch Eager 在 ResNet50 和 Bert 两个模型小 batch 场景下有明显性能优势。
备注:测试使用PyTorch版本为 v2.1.0,cuda 版本为 12.1。
Published by jackalcooper almost 2 years ago
OneFlow v0.9.0 came out, welcome to install the new version for a better experience.
This update contains 640 commits and the following highlights:
With the addition of 86 new API interfaces and operators aligned with PyTorch and the fix of 104 bugs related to operator compatibility, OneFlow v0.9.0 provides better PyTorch API and model compatibility. In v0.9.0, users can migrate more PyTorch models to OneFlow with one click and gain faster performance.
Allowing one-click migration of Stable Diffusion、GLM、YOLOv5 etc to OneFlow.
More convenient model migration. Oneflow.load
supports loading the torch.save
models directly.
With the newly added oneflow.mock_torch
module and mock
method, oneflow can migrate complex PyTorch models containing multiple scripts with one click without changing the original PyTorch script.
Global Tensor has added a series of interfaces and methods that are convenient for distributed programming, and fixed known related bugs.
The Graph released a new feature of automatic parallelism (version 1), which supports automatic search for the fastest SBP with a specified Placement. When writing distributed models with Global Tensor, users do not need to consider parallelism.
The Graph adds a series of optimizations related to memory, execution speed, pipeline masking, and compilation speed to improve performance and reduces memory overhead.
The Graph provides a series of functions to aid debugging, including analyzing memory logs, displaying the progress during the compilation stage, and the computation graph.
OneFlow IR provides more compilation optimization functions.
The error prompt of OneFlow is more user-friendly, which supports highlighting the error content and simplifies unnecessary information details inside the system. In this connection, you can visually learn about the location and type of the error.
A series of operator optimizations and system optimizations have been added, including Eager instruction scheduling, high-performance CUDA kernel, opening up of multiple memory pools, etc.
To solve the possible duplicate name conflict between Graph.Block.config and module user-defined attribute module.config, OneFlow redesigned the abstraction of Graph proxy Module/Tensor, thus introducing a breaking change: (https://github.com/Oneflow-Inc/oneflow/pull/9351 , https://github.com/Oneflow-Inc/oneflow/pull/9437,https://github.com/Oneflow-Inc/oneflow/pull/9607)
The attr and config attributes on Block are removed, and Block is renamed to Proxy;
Implementation plan: When added as members of nn.Graph, the original Eager Module and Tensor types will be packaged into the Proxy class, and the corresponding GraphModule and GraphTensor will be generated; nn.Graph will use Proxy in the subsequent composition For proxy execution, when the proxy is executed, the original eager type and graph type can be obtained from the Proxy. The naming refers to the naming of torch.fx.
Eager primitive type | Graph type, base class Graph Block | Proxy execution type, the base class is called Proxy | |
---|---|---|---|
Function | Supporting to get the original eager type | A Graph code block corresponding to GraphBlock stores the information required for graph execution, such as name/scope/lazy op or tensor and optimization switches of some sub-modules on the graph. | Proxy execution capability, using the same execution interface as Module and Tensor, but the behavior has changed, such as lazy, and the op that may be executed has also been rewritten. |
Module type | Module | GraphModule | ProxyModule contains a Module member and a GraphModule member |
Tensor type | Tensor | GraphTensor | ProxyTensor contains a Tensor member and a GraphTensor member |
import oneflow as flow
import oneflow.nn as nn
from oneflow.nn.graph import GraphModule
linear = flow.nn.Linear(3, 8, False)
class LinearGraph(nn.Graph):
def __init__(self):
super().__init__()
# The type of linear is nn.Module. When added as an attribute of nn.Graph, it will be registered with nn.Graph.
# self.linear has been wrapped as a ProxyModule.
#self.linear.weight has been wrapped as a ProxyTensor.
#nn.Graph will use ProxyModule to perform graph composition.
self.linear = linear
# There are two parts in ProxyModule, one is the original module and the other is GraphModule.
self.linear.to(GraphModule) # Get the corresponding GraphModule, on which you can do configuration related to graph optimization.
# such as setting a pipeline stage for a module, and enabling pipeline parallelism.
self.linear.to(GraphModule).set_stage(id, placement)
self.linear.to(nn.Module) # get the corresponding original nn.Module.
self.linear.weight.to(flow.Tensor) # get the corresponding original Tensor.
Outdated interface in OneFlow v0.8.0:
import oneflow as flow
import oneflow.nn as nn
linear = flow.nn.Linear(3, 8, False)
class LinearGraph(nn.Graph):
def __init__(self):
super().__init__()
self.linear = linear
self.linear.config.set_stage(id, placement) # set stage
self.linear.config.activation_checkpointing = True # set activation checkpointing
self.linear.origin # get the corresponding original nn.Module
self.linear.weight.origin # get the corresponding original Tensor
New interface in OneFlow v0.9.0:
import oneflow as flow
import oneflow.nn as nn
from oneflow.nn.graph import GraphModule
linear = flow.nn.Linear(3, 8, False)
class LinearGraph(nn.Graph):
def __init__(self):
super().__init__()
self.linear = linear
self.linear.to(GraphModule).set_stage(id, placement) # set stage
self.linear.to(GraphModule).activation_checkpointing = True # set activation checkpointing
self.linear.to(nn.Module) # get the corresponding original nn.Module
self.linear.weight.to(flow.Tensor) # get the corresponding original Tensor
Adds automatic parallelization feature for the first stage in Graph: (https://github.com/Oneflow-Inc/oneflow/pull/8891, https://github.com/Oneflow-Inc/oneflow/pull/9172 , https://github.com/Oneflow-Inc/oneflow/pull/9288)
Automatic parallelism can be enabled by configuring self.config.enable_auto_parallel(True)
in Graph. After it is enabled, you don't have to configure sbp, and the Graph will automatically find the optimal sbp combination.
Here is an exmaple:
import oneflow as flow
class SubclassGraph(flow.nn.Graph):
def __init__(self):
super().__init__() # MUST be called
# auto parallelism configuration
self.config.enable_auto_parallel(True)
# other configurations about auto parallelism
# ......
def build(self):
pass
Graph supports straightened algorithm optimization with memory priority, reducing the memory life cycle of each Tensor by adjusting the execution sequence to reduce the peak value of memory. (https://github.com/Oneflow-Inc/oneflow/pull/9094)
With self.config.enable_straighten_algorithm("MemoryFirst")
, the straightened algorithm with memory optimization can be enabled.
The available modes are as follows: "MemoryFirst"
/ "SpeedFirst"
/ "Disable"
/ "OverlapCpuGpu"
At the same time, Graph adds the algorithm "OverlapCpuGpu"
that make CPU and GPU kernel overlap with each other as much as possible. (https://github.com/Oneflow-Inc/oneflow/pull/9278)
Graph provides generalized basic transmission, using nccl send/recv to realize fast communication for any NdSbp (2d, 3d,...), thus minimizing the transmission volume.(https://github.com/Oneflow-Inc/oneflow/pull/8437 , https://github.com/Oneflow-Inc/oneflow/pull/8783)
With autograd.Function, Graph is allowed to use custom op (https://github.com/Oneflow-Inc/oneflow/pull/8843).
You can use the Graph Optimizer through param_group["lr_scale"]
, supporting configuring the learning rate for the parameter of each module/layer. (https://github.com/Oneflow-Inc/oneflow/pull/9138)
Adds enable_multi_tensor_update
optimization. Enabling by self.config.enable_multi_tensor_update(True)
, it will optimize the overhead of numerous broken parameters when updating the model. (https://github.com/Oneflow-Inc/oneflow/pull/9209, https://github.com/Oneflow-Inc/oneflow/pull/9252)
Adds enable_fused_model_update_cast
optimization. Enabling by self.config.enable_fused_model_update_cast(True)
, it will speed up the training speed of the network by fusing Optimizer and fp16 cast when AMP is on. (https://github.com/Oneflow-Inc/oneflow/pull/9209)
Graph supports non-uniform segmentation under ND-SBP. (https://github.com/Oneflow-Inc/oneflow/pull/9310)
Graph supports LazyTensor's indexing feature.
(https://github.com/Oneflow-Inc/oneflow/pull/9334)
Adds enable_compress_memory
interface. Enabling by self.config.enable_compress_memory(True)
, it will try to optimize the memory and iterate the video memory of the computation graph within a half hour. Finally, the minimum value close to the lower limit will be found. (https://github.com/Oneflow-Inc/oneflow/pull/9509)
Adds oneflow.utils.global_view.global_mode
. It supports smooth migration from single-GPU code to multi-GPU code. This global_mode will create a global context with on/off support. In addition, it will set the default placement and sbp under the context and support various grammar of LocalTensor such as Tensor.device
and Tensor.to(device)
. The source op created in this context will automatically generate the GlobalTensor and populate the default placement and sbp. This context enables the logic of the local tensor in the module to convert to global logic in a non-invasive manner.
Here is an example:
import oneflow as flow
from oneflow.utils.global_view import global_mode
P_C = flow.placement("cpu", ranks=[0, 1])
P = flow.placement("cuda", ranks=[0, 1])
B = flow.sbp.broadcast
S0 = flow.sbp.split(0)
x = flow.ones((6, 8), placement=P_C, sbp=S0)
with global_mode(True, placement=P, sbp=B):
device = linear_dp.weight.device
x = x.to(device) # global tensor to device
out = linear_dp(x)
# The local tensor will be converted to global
sample = flow.randn(out.shape, device="cpu").to(device)
Provides comprehensive memory analysis logs V2.0 (https://github.com/Oneflow-Inc/oneflow/pull/8565)
export GLOG_v = 3
enables the environment variable to see the full memory analysis log in oneflow.INFO.
Adds shape, dtype, life cycle, and order of application for release of all tensors in each memory block (Chunk, MemBlock), which helps to quickly find out whether the tensor that greatly affect occupied memory in each memory block is normal or not.
The Checkpointing pass provides a log, recording tensors with Checkpoint.
Adds time_util to record the execution time of each module, actual physical memory occupied, and virtual memory occupied. (https://github.com/Oneflow-Inc/oneflow/pull/9164,https://github.com/Oneflow-Inc/oneflow/pull/9245)
Graph will display the compilation progress bar when the rank 0 calculation Graph is compiled when enabling such environment variables as debug(0)
and ONEFLOW_NNGRAPH_ENABLE_PROGRESS_BAR=1
. (https://github.com/Oneflow-Inc/oneflow/pull/9537)
The default log directory is removed (The directory will not be created and be written to log files by default.) The log directory print logs will be generated when in ONEFLOW_DEBUG_MODE=1
. (https://github.com/Oneflow-Inc/oneflow/pull/9552 , https://github.com/Oneflow-Inc/oneflow/pull/9575)
Adds parameter map_location
to oneflow.load
to support the placement or device of the specified loading model Tensor. (https://github.com/Oneflow-Inc/oneflow/pull/8666)
Adds the oneflow.async.thread
to allow users to create a new thread for asynchronous programming. (https://github.com/Oneflow-Inc/oneflow/pull/8866 , https://github.com/Oneflow-Inc/oneflow/pull/9039 , https://github.com/Oneflow-Inc/oneflow/pull/9270)
oneflow.save
supports saving ddp Module objects directly. (https://github.com/Oneflow-Inc/oneflow/pull/8856)
Adds oneflow.utils.checkpoint
to support Checkpointing optimization under eager. (https://github.com/Oneflow-Inc/oneflow/pull/9053)
With the newly added oneflow.mock_torch
module and mock
method, the effect of one-click migration to oneflow can be realized without changing the original script of import torch. The benefit of this method is that all you need to do is add a new line instead of modifying the imports of files one by one (https://github.com/Oneflow-Inc/oneflow/pull/9160 , https://github.com/Oneflow-Inc/oneflow/pull/9256 , https://github.com/Oneflow-Inc/oneflow/pull/9442 , https://github.com/Oneflow-Inc/oneflow/pull/9473). You can use it with the following code:
import torch
from oneflow.mock_torch import mock
mock()
# torch code
# ...
Supports mocks with scope, such as:
import torch
from oneflow.mock_torch import mock
with mock.enable():
# torch code
# ...
Supports autograd's backward graph visualization debug: When enabling ONEFLOW_DEBUG_MODE=1 environment variable, each backward computation will generate the AutogradEngine execution graph to the dot file in the log directory. As is shown in the figure, you can see the operators of backward execution and topologies, which provides an easy way for algorithm and R&D personnel to debug backward problems. (https://github.com/Oneflow-Inc/oneflow/pull/9412)
Published by jackalcooper over 2 years ago
OneFlow v0.8.0 came out, welcome to install the new version for a better experience.
This update contains 523 commits and the following highlights:
PyTorch compatible APIs have been further optimized, 68 new APIs aligned with PyTorch have been added, and 84 compatibility bugs between operator and interface have been fixed. More PyTorch models support being one-button transferred into OneFlow.
All operators support Global Tensor more completely and efficiently, 28 Global Tensor-related bugs have been fixed, and 180 operator unit tests have been newly added.
Graph's advanced features have been further optimized:
In addition to the existing ZeRO-DP, Zero Redundancy Optimizer(ZeRO) can also be used in combination with MP parallelism, 2D parallelism, and 3D parallelism, which saves more memory overhead.
Graph provided new pipeline parallelism API, which not only simplifies the pipeline parallelism configuration but also optimizes the performance of pipeline parallelism and 3D parallelism.
Multi-dimensional debugging functionality in the logic graph, light plan physical graph, memory analysis, Python stack information, and others have been newly added, making Graph.debug more efficient.
Empowered by OneFlow v0.8.0 and LiBai v0.2.0, 3D parallelism speed under GPT and BERT witnesses a notable increase, and its training speed performance exceeds Megatron-LM with same configuration in multiple dimensions. For more details, please click here.
OneEmbedding has been released recently. It is an extension component designed for large-scale recommendation systems, boasting high efficiency, extensibility, flexibility, and other advantages.
Multi-Device adaptation: OneFlow v0.8.0 has provided a neat, efficient, and easily-extensible hardware abstraction layer called EP(Execution Provider) and defined a collection of basic computing interfaces called Primitive, allowing to re-implement kernels based on Primitive interface.
Added new debugging tool stacks: OneFlow-Profiler and AutoProf
OneFlow-Profiler is a tool designed to collect performance information during framework execution. It can record the execution time of operators and system components, the allocation of memory and DRAM, and the corresponding input and parameters of operators. The information can help developers find out the main source of overhead in framework execution and thus implement targeted optimization.
AutoProf is a framework designed to efficiently detect the alignment between OneFlow APIs and PyTorch APIs. Besides, it can automatically compare the performance results of OneFlow APIs and PyTorch APIs.
Significantly optimized the exception handling process in OneFlow API and improved the error message when APIs meet exceptions.
Significantly optimized the OneFlow API documentation: the API documentation has been restructured based on functionality. In addition to general operator APIs, oneflow.nn.graph
, oneflow.embedding
, oneflow.autograd
and other modules in OneFlow and their environment variables have also been explained in detail.
Outdated configuration method in OneFlow v0.7.0:
import oneflow as flow
class Graph(flow.nn.Graph):
def __init__(self):
super().__init__()
self.linear = flow.nn.Linear(3, 8, False)
self.config.set_zero_redundancy_optimizer_mode("distributed_split")
if zero_stage > 1:
# stage 2
flow.boxing.nccl.enable_use_compute_stream(True)
if zero_stage > 2:
# stage 3
flow.boxing.nccl.disable_group_boxing_by_dst_parallel(True)
def build(self, x):
return self.linear(x)
graph = Graph()
New interface in OneFlow v0.8.0:
import oneflow as flow
class Graph(flow.nn.Graph):
def __init__(self):
super().__init__()
self.linear = flow.nn.Linear(3, 8, False)
self.config.enable_zero(stage=2)
def build(self, x):
return self.linear(x)
graph = Graph()
axis
(remains compatible) in oneflow.sbp.split()
has been uniformly changed into using dim
to represent the slice dimension.(https://github.com/Oneflow-Inc/oneflow/pull/8411)v0.7.0
oneflow.sbp.split(axis=0)
v0.8.0
oneflow.sbp.split(dim=0)
self.module_layer_0.config.stage_id = 0
(this method is not suggested ), we have added a novel pipeline parallelism API config.set_stage
, which optimizes pipeline parallelism performance as well as avoids implementing the input_tensor.to_global(placement=this_stage_placement)
operation for all module input tensors at every stage. (https://github.com/Oneflow-Inc/oneflow/pull/8442)v0.7.0
import oneflow as flow
B = [flow.sbp.broadcast]
P_0 = flow.placement(type = "cuda", ranks = [0, 1])
P_1 = flow.placement(type = "cuda", ranks = [2, 3])
class Graph(flow.nn.Graph):
def __init__(self):
super().__init__()
self.m_stage0 = flow.nn.Linear(8, 8, False).to_global(placement=P_0, sbp=B)
self.m_stage1 = flow.nn.Linear(8, 8, False).to_global(placement=P_1, sbp=B)
# Set different module's stage id to hint the graph preparing right num of buffers in pipeline.
self.m_stage0.config.stage_id = 0
self.m_stage1.config.stage_id = 1
self.config.set_gradient_accumulation_steps(4)
def build(self, x):
x = x.to_global(placement=P0, sbp=B)
y = self.m_stage0(x)
# Move tensor between different pipeline stages.
y = y.to_global(placement=P1, sbp=B)
z = self.m_stage1(y)
return z
v0.8.0
class Graph(flow.nn.Graph):
def __init__(self):
super().__init__()
self.m_stage0 = flow.nn.Linear(8, 8, False).to_global(placement=P_0, sbp=B)
self.m_stage1 = flow.nn.Linear(8, 8, False).to_global(placement=P_1, sbp=B)
# set_stage(stage_id, placement)
# The Stage ID is numbered starting from 0 and increasing by 1.
# The Placement is all tensors placement of this module.
self.m_stage0.config.set_stage(stage_id=0, placement=P_0)
self.m_stage1.config.set_stage(stage_id=1, placement=P_1)
self.config.set_gradient_accumulation_steps(4)
def build(self, x):
# There will be automatically do tensor.to_global(placement) for all input tensor of this module.
# So there is no need to write to_global() in/out of the module forward function.
y = self.m_stage0(x)
z = self.m_stage1(y)
return z
Added new interfaces: oneflow.env.init_rdma
and oneflow.env.rdma_is_initialized
to delay turning on the RDMA, thus accelerating the network communications across multiple devices (Note: avoid using fork() after RDMA being turned on, for example, DataLoader’s num_workers > 1
should be executed before init rdma
). https://github.com/Oneflow-Inc/oneflow/pull/8415
Graph provided new algorithm optimization interface: graph.config.enable_straighten_algorithm
to optimize the execution order in computation graph, which maximizes the overlap between transferring and computation. With this interface, the data transfer speed witnesses a 0.6% rise in data parallelism mode and a 6% rise in model parallelism mode. (https://github.com/Oneflow-Inc/oneflow/pull/8347, https://github.com/Oneflow-Inc/oneflow/pull/8483, https://github.com/Oneflow-Inc/oneflow/pull/8495 )
Optimized the implementation of clip grad in Graph to support clip_grad_max_norm > 1.0
and provided configurable clip_grad_norm_type
, which could only be set to 2
before but now can be set to +/- inf
, +/- 1
, +/- 2
, +/- 3
, and bigger p-norm values. See the reference from here (https://github.com/Oneflow-Inc/oneflow/pull/7548)
Global tensor in Graph supported the tensor.set_item
operation for invariable ops, for example, mask[:, :len_keep] = 0
(https://github.com/Oneflow-Inc/oneflow/pull/7751)
Graph exported build_graph
and compile_and_init_runtime
interfaces, allowing to compile the pass
that was previously self-defined by users after building the graph, thus rewriting and optimizing the graph. The two interfaces also supported Graph to restore an external graph (job). (https://github.com/Oneflow-Inc/oneflow/pull/8168)
Added the RegisterJobPass
interface to support rewriting the self-defined external job pass graph. (https://github.com/Oneflow-Inc/oneflow/pull/8370)
oneflow.boxing.nccl.enable_use_compute_stream(True)
optimized supports for NCCL logical kernel:
Added noncontiguous ReduceScatter kernel to support the conversion of P -> S(i), (i > 0)
(https://github.com/Oneflow-Inc/oneflow/pull/8361)
Supported the conversion of B -> S
(https://github.com/Oneflow-Inc/oneflow/pull/8355)
Enabled nccl send/recv primitives to support special SBP conversions (https://github.com/Oneflow-Inc/oneflow/pull/8318)
Added the efficient fused kernel oneflow.nn.FusedMLP
, which is controlled by export ONEFLOW_FUNCTOR_DISABLE_FUSED_MLP = 0
(https://github.com/Oneflow-Inc/oneflow/pull/7391, https://github.com/Oneflow-Inc/oneflow/pull/8165, https://github.com/Oneflow-Inc/oneflow/pull/8217, https://github.com/Oneflow-Inc/oneflow/pull/8413)
Graph.debug
offered the new parameter: max_stack_depth (default = 2)
to note the maximal stack depth of the Python stack where the op exists in Graph, making it convenient to locate the Python context for each op in Graph. (https://github.com/Oneflow-Inc/oneflow/pull/8028)
Apart from supporting printing the input/output/variable info of modules in Graph, it also newly supported printing operator info constructed in module forward. (https://github.com/Oneflow-Inc/oneflow/pull/8135)
Enabled export ONEFLOW_DEBUG_MODE=true
and export GLOG_v=3
to print the full memory log, which contains multi-level MemBlock info on each device (Total Memory-> Chunk -> MemBlock), Block that has exclusive memory, Eager Variable and other information. Besides, a lifecycle label was added in Regst to analyze each tensor's memory lifecycle.
LightPlan provided a more simplified way to display Actor Graph, cutting down the cost of debug based on Plan. When ONEFLOW_DEBUG_MODE = true
, a series of light plan files corresponding to each rank in Graph will be generated under the log/local_rank_0/machine/
directory, containing simplified actor sub-graphs in each rank, and the filename is GraphName_rank_i_light_plan
. (https://github.com/Oneflow-Inc/oneflow/pull/8396)
The print graph
method allowed to display the logic graph by Module, making the debugging more efficient in constructing graphs. (https://github.com/Oneflow-Inc/oneflow/pull/8131)
Supported passing extra parameters when Optimizer ParamGroup is being built, meeting other special operation demands for LrScheduler. (https://github.com/Oneflow-Inc/oneflow/pull/7753)
param_groups = [{"params": [model.parameters()], "excess_param": ...}]
optim = optim.Adam(param_groups, lr=0.1)
Added the oneflow.cuda.current_device
interface to return the device index of the current rank (https://github.com/Oneflow-Inc/oneflow/pull/7856)
Added the oneflow.utils.from_torch
interface to convert a PyTorch Tensor into an OneFlow Tensor (https://github.com/Oneflow-Inc/oneflow/pull/7851)
Added the oneflow.utils.to_torch
interface to convert an OneFlow Tensor into a PyTorch Tensor (https://github.com/Oneflow-Inc/oneflow/pull/7851)
Added the oneflow.cuda.empty_cache
interface to manually release memory https://github.com/Oneflow-Inc/oneflow/pull/8482)
Added the oneflow.roc_auc_score
interface on CPU, which is equivalent to sklearn.metrics.roc_auc_score
(https://github.com/Oneflow-Inc/oneflow/pull/7951)
Provided the Tensor.contiguous_
interface as the contiguous operation for the inplace version (https://github.com/Oneflow-Inc/oneflow/pull/8275)
Added the Tensor.local_to_global
and Tensor.global_to_global
interfaces to separately implement different default check meta operations (https://github.com/Oneflow-Inc/oneflow/pull/8027)
Global Tensor's Slice/SliceUpdate supported all nd_sbp inputs, and SliceUpdate fully supported the inplace operation and backpropagation (https://github.com/Oneflow-Inc/oneflow/pull/8313, https://github.com/Oneflow-Inc/oneflow/pull/8337, https://github.com/Oneflow-Inc/oneflow/pull/8344, https://github.com/Oneflow-Inc/oneflow/pull/8416)
Eager Global Tensor supported balanced spliter nd sbp eager boxing (https://github.com/Oneflow-Inc/oneflow/pull/7768)
Supported executing Eager Slice Boxing on random devices, including non-CPU devices and non-CUDA-capable devices (https://github.com/Oneflow-Inc/oneflow/pull/8180)
For better recommendations, modern recommendation systems always rely on huge Embedding tables. Besides, frequent iterations of user data require model training to be fast enough.
OneEmbedding is a component designed for large-scale recommendation systems, and it's efficient, extensible, and highly flexible. The following are its advantages:
Hierarchical storage and dynamic capacity expansion: users can expand the capacity of the Embedding at much lower cost.
Mixed parallelism strategy: it supports easily extending the model to train it on multi-machine multi-GPU.
Embedding quantization for better communication: in the parallel scenario, communication data can be quantized to reduce the communication amount, thus accelerating the training.
Efficient data pipeline: the model parts that have no data dependency can be executed in advance, thus overlapping with other operations in time.
Automatic mixed precision training: data can be computed in FP16 to reduce the occupied memory, thus accelerating the training speed and ensuring high model convergence precision.
A collection of efficient CUDA ops for common operations in recommendation systems is available.
Flexible model building is supported.
See OneEmbedding API documentation from here.
A collection of new functionalities and interfaces that are compatible with PyTorch 1.10.0 have been added.
Added the Tensor.pin_memory
functionality, which supports changing the memory to pinned memory when the tensor is being created. (https://github.com/Oneflow-Inc/oneflow/pull/8073)
Supported passing the pin_memory
parameter when the tensor is being created. (https://github.com/Oneflow-Inc/oneflow/pull/8176)
DataLoader supported pin_memory
(https://github.com/Oneflow-Inc/oneflow/pull/8214)
Added theTensor.is_pinned
attribute (https://github.com/Oneflow-Inc/oneflow/pull/8447)
Added the ~Tensor
(invert) method to conduct logic NOT operation for each tensor with the dtype of .bool. (https://github.com/Oneflow-Inc/oneflow/pull/7899)
Added the Tensor.log2
method to get log2 for each tensor. (https://github.com/Oneflow-Inc/oneflow/pull/7906)
Added the Tensor.new_zeros
method to generate a new tensor that has a shape of 0. (https://github.com/Oneflow-Inc/oneflow/pull/7937)
Added the oneflow.as_tensor
interface to convert the input data into a new tensor that shares data. (https://github.com/Oneflow-Inc/oneflow/pull/7855)
Added the Tensor.__array__
method. np.array
supports to input oneflow tensor to construct np.ndarry
object. (https://github.com/Oneflow-Inc/oneflow/pull/7970)
Added the Tensor.new_tensor
method to copy the input data to generate a new tensor. (https://github.com/Oneflow-Inc/oneflow/pull/7973)
Added the Tensor.half
method, which is equivalent to tensor.to (oneflow.float16)
. (https://github.com/Oneflow-Inc/oneflow/pull/7971)
Added the Tensor.byte
method to generate a new uint8 tensor, and tensor.byte()
is equivalent to tensor.to(oneflow.uint8)
. (https://github.com/Oneflow-Inc/oneflow/pull/8053)
Added the Tensor.view_as
and Tensor.new_empty
methods (https://github.com/Oneflow-Inc/oneflow/pull/8077)
Added the Tensor.type
method to implement corresponding cast and adding objects for oneflow(.cuda).{Byte, Char, Short, Int, Long, Half, Float, Double}Tensor
(https://github.com/Oneflow-Inc/oneflow/pull/8129)
Added the Tensor.dot
method to compute the dot product of two 1D tensors, and this method is equivalent to oneflow.dot
. (https://github.com/Oneflow-Inc/oneflow/pull/8520)
Added the oneflow.nn.init.orthogonal_
interface to initialize tensors (https://github.com/Oneflow-Inc/oneflow/pull/8009)
Added the oneflow.nn.Softshrink
op (https://github.com/Oneflow-Inc/oneflow/pull/7826)
Added the oneflow.nn.Threshold
op (https://github.com/Oneflow-Inc/oneflow/pull/7875)
Added the oneflow.nn.Hardshrink
activation function (https://github.com/Oneflow-Inc/oneflow/pull/7887)
Added the oneflow.isnan
and oneflow.isinf
interfaces to decide the element in tensor is nan or inf (https://github.com/Oneflow-Inc/oneflow/pull/7943)
The oneflow.nn.functional.*
interface supported passing the numpy scalar
parameter (https://github.com/Oneflow-Inc/oneflow/pull/7935)
Added the oneflow.nn.functional.cosine_similarity
op to calculate the cosine similarity of two tensors (https://github.com/Oneflow-Inc/oneflow/pull/8119)
Added the oneflow.nn.functional.conv_transpose1d
, the oneflow.nn.functional.conv_transpose2d
op, and thenn.functional.conv_transpose3d
op (https://github.com/Oneflow-Inc/oneflow/pull/7991)
Added the oneflow.unbind
interface to return a tuple of all slices along a given dimension (https://github.com/Oneflow-Inc/oneflow/pull/7730)
Added the oneflow.swapdims
interface to specify the swapping of two dimensions, and oneflow.swapdims
is equivalent to NumPy’s swapaxes
. (https://github.com/Oneflow-Inc/oneflow/pull/7659)
Added the oneflow.addcmul
op to execute an element-wise composite function: out=input+value×tensor1×tensor2
(https://github.com/Oneflow-Inc/oneflow/pull/7282)
Added the oneflow.searchsorted
op (https://github.com/Oneflow-Inc/oneflow/pull/7949)
Added the oneflow.mm
op (https://github.com/Oneflow-Inc/oneflow/pull/8440)
Added the oneflow.tensordot
interface and offered a collection of cases of equivalent transformation operations (https://github.com/Oneflow-Inc/oneflow/pull/7968)
Added the oneflow.repeat_interleave
op to repeat the elements of the tensor, and this op is equivalent to numpy.repeat
(https://github.com/Oneflow-Inc/oneflow/pull/8324)
Added the oneflow.amax
and Tensor.amax
methods (https://github.com/Oneflow-Inc/oneflow/pull/7996)
Added the oneflow.median
and Tensor.median
methods (https://github.com/Oneflow-Inc/oneflow/pull/8069)
Added the oneflow.normal
method and fixed the Tensor.normal
method (https://github.com/Oneflow-Inc/oneflow/pull/7956)
Added the oneflow.amin
and Tensor.amin
methods (https://github.com/Oneflow-Inc/oneflow/pull/8042)
Added the oneflow.mv
op and Tensor.mv
method (https://github.com/Oneflow-Inc/oneflow/pull/8445)
oneflow.cuda.manual_seed
, oneflow.cuda.manual_seed_all
, oneflow.seed
, oneflow.manual_seed
, oneflow.initial_seed
, oneflow.get_rng_state
, oneflow.set_rng_state
and improved the configuration of OneFlow random seed initialization. (https://github.com/Oneflow-Inc/oneflow/pull/7957 )Added new interfaces: oneflow.set_grad_enabled
and oneflow.enable_grad
to enable or disable automatic gradient update for some of subgraphs. (https://github.com/Oneflow-Inc/oneflow/pull/8016)
Supported the upstream gradient dtype of the autograd reverse operator is different from that of the input. (https://github.com/Oneflow-Inc/oneflow/pull/8233, https://github.com/Oneflow-Inc/oneflow/pull/8309)
Supported the backward operator that does not capture any tensor to execute backward computation multiple times. (https://github.com/Oneflow-Inc/oneflow/pull/8031)
oneflow.cuda.set_device
and oneflow.cuda.synchronize
. (https://github.com/Oneflow-Inc/oneflow/pull/8322)Refactored the Module of RNN and migrated the implementation of Python layer splicing to C++, which greatly optimized the performance. Added modules related to RNNCell and modules aligned with the torch.nn.utils.rnn
in functionality:
RNN
, LSTM
, and GRU
RNNCell
, LSTMCell
, GRUCell
, andoneflow.nn.utils.rnn
Supported heterogeneous equipment type: In order to cope with the complexity of different hardware, OneFlow, in line with the dependency inversion principle in software engineering, has introduced a hardware abstraction layer called Execution Provider (EP). The hardware abstraction layer is composed of a series of interfaces, which are abstracted from the capabilities provided by the required hardware devices during the running of the framework. After the hardware abstraction layer is introduced, each modules can directly call the interface provided by the hardware abstraction layer, not the original hardware interface, to use the underlying hardware, so it's unneccessary to concern the specific details of the underlying hardware. When a new hardware device is introduced, because the hardware abstraction interface remains unchanged, all modules can adapt to the new hardware device without any modification. At the same time, when adapting new hardware for the framework, we do not need to pay attention to the specific implementation details of the framework. We only need to implement a series of interfaces according to the agreement of the hardware abstract interface and the actual situation of the hardware device, and then the hardware adaptation can be completed.
Execution Provider has defined a collection of runtime interfaces: device registration interface, device management interface, queue management interface, event management interface, and memory management interface.
In addition to the runtime interfaces, the Execution Provider has also defined a set of computing interfaces called Primitive, which are used to describe the commonly-used computation in the deep learning framework, thus simplifying the development of operators in hardware adaptation. Compared with the runtime interfaces provided by the Execution Provider, the interfaces provided by Primitive are more loose and flexible. All interfaces are mutually independent, and each interface represents a specific computing capability provided by a certain hardware device. Similar to runtime interfaces, the abstraction of interfaces provided by Primitive is closer to the device side, and developers can carry out adaptation work without an in-depth understanding of OneFlow's mechanism. Developers need to implement all interfaces provided by Execution Provider when adapting runtime interfaces, but in the process of adapting Primitive, developers can selectively adapt according to the actual situation of the project.
Added unit test of ep::primitive
basic function (https://github.com/Oneflow-Inc/oneflow/pull/8099)
Added ep::primitive::constant_pad
, optimized performance, removed obsolete pad grad and used pad as the inverse of pad (https://github.com/Oneflow-Inc/oneflow/pull/8152)
Used unary primitive interface instead of original implementation in Kernel (https://github.com/Oneflow-Inc/oneflow/pull/8270)
Added environment variable ONEFLOW_EP_CUDA_CUBLAS_WORKSPACE_SIZE_MB to configure cublas workspace size (https://github.com/Oneflow-Inc/oneflow/pull/8478)
Scalar logical kernel supported primitives (https://github.com/Oneflow-Inc/oneflow/pull/8531)
Used primitives to implement logical not kernel (https://github.com/Oneflow-Inc/oneflow/pull/8544)
Migrated all activation kernels to use primitive (https://github.com/Oneflow-Inc/oneflow/pull/8300)
Bias add kernel supported primitive (https://github.com/Oneflow-Inc/oneflow/pull/8512)
Decoupled OneDNN from ep::primitive
CPU device and provided environment variable ONEFLOW_ENABLE_ONEDNN_OPTS
to enable onednn to accelerate CPU primitive interface (https://github.com/Oneflow-Inc/oneflow/pull/8274)
Saved the log independently for each rank to log/local_rank_{i}
when launching multiple processes by launcher. (https://github.com/Oneflow-Inc/oneflow/pull/7825)
Optimized the display of OF_PROFILER_RANGE_GUARD in nsys. (https://github.com/Oneflow-Inc/oneflow/pull/8121)
OneFlow-Profiler is designed to collect various performance-related information during the execution flow of the framework. It can calculate the execution time of the operator or system components, the allocation of memory and DRAM, and can record the input and parameter information corresponding to the operator. This information can be used by developers to analyze which part brings the most overhead and implement some targeted optimizations.
Added OneFlow-Profiler. (https://github.com/Oneflow-Inc/oneflow/pull/8047)
Profiled the information of the CUDA operator. (https://github.com/Oneflow-Inc/oneflow/pull/8195)
Profiled the bandwidth information of the operator. (https://github.com/Oneflow-Inc/oneflow/pull/8254)
Added interfaces to collect bandwidth information and optimized code implementation. (https://github.com/Oneflow-Inc/oneflow/pull/8332)
Refined Profiler. (https://github.com/Oneflow-Inc/oneflow/pull/8332)
Used Kineto and CUPTI to profile the information of CUDA operator. (https://github.com/Oneflow-Inc/oneflow/pull/8417)
AutoProf is a framework designed to test the performance of OneFlow and PyTorch operators. It can automatically test the operator performance and print a comparison table under different CPU threads and GPUs. At present, it has been applied to the development of some existed operators and all new operators. Its effect is shown below:
Added auto speed comparison framework of operator AutoProf to automatically run op to test: (https://github.com/Oneflow-Inc/oneflow/pull/8207)
The speed of OneFlow and PyTorch.
The speed of CPU/GPU Kernel under different numbers of threads.
Total end-to-end time with CPU Kernel.
Optimized the display of AutoProf to save testing time. (https://github.com/Oneflow-Inc/oneflow/pull/8303)
Supported API tests without actual kernel execution, and the time would be end2end. (https://github.com/Oneflow-Inc/oneflow/pull/8320)
Supported AutoProf to measure kernel bandwidth. (https://github.com/Oneflow-Inc/oneflow/pull/8367)
Used Cast to remove pass. (https://github.com/Oneflow-Inc/oneflow/pull/7837 )
Used MLIR to complete constant folding, combined the composition optimization of Conv and BN. (https://github.com/Oneflow-Inc/oneflow/pull/7799)
Optimized constant folding in OneFlow C++ API. (https://github.com/Oneflow-Inc/oneflow/pull/8124)
Provided fault tolerance checking for parsed module. (https://github.com/Oneflow-Inc/oneflow/pull/8299)
Fixed the BUG of constant folding unit test. (https://github.com/Oneflow-Inc/oneflow/pull/8340)
Supported IREE. (https://github.com/Oneflow-Inc/oneflow/pull/8249)
Added oneflow_iree(python)
to CI. (https://github.com/Oneflow-Inc/oneflow/pull/8431)
Removed redundant output_lbns in IR. (https://github.com/Oneflow-Inc/oneflow/pull/8409)
Provided a conversion marker for Variable -> constant. (https://github.com/Oneflow-Inc/oneflow/pull/8412)
Removed hardcoded properties in IR. (https://github.com/Oneflow-Inc/oneflow/pull/8420)
Implemented AutoNHWC Pass and provided environment variable ONEFLOW_MLIR_PREFER_NHWC
. Supported automatic conversion of common network data formats to channels last optimization and had a noticeable acceleration on NVIDIA graphics cards that support FP16. (https://github.com/Oneflow-Inc/oneflow/pull/7890)
Optimized the speed and memory of GPT and BERT under 3-D parallelism:
Performance optimization: fused_scale_mask_softmax
operator supported broadcast input. Optimized the kernel implementation and performance of softmax under specific cols (1024). Optimized the incomplete GetSbp list of fused_scale_mask_softmax
reverse operator. (https://github.com/Oneflow-Inc/oneflow/pull/8321)
Communication optimization: Optimized the communication cost of SBP cost under B->S
, B->B
, B->P
. (https://github.com/Oneflow-Inc/oneflow/pull/8378)
Interface optimization: Optimized the inefficient edge connection problem caused by the misalignment of stage id and to_global sequence dependency when using pipeline stage. (https://github.com/Oneflow-Inc/oneflow/pull/8442)
Communication optimization: nccl_use_compute_stream
supported more comprehensive sbp conversions like P -> S(i)
. (https://github.com/Oneflow-Inc/oneflow/pull/8361)
Communication optimization: Parallel use of RDMA communication. (https://github.com/Oneflow-Inc/oneflow/pull/8415)
Memory optimization: Eliminated the randomness of the memory multiplexing algorithm, so that the memory multiplexing effect of each rank is consistent when the subgraphs are the same. There will be no bad case. (https://github.com/Oneflow-Inc/oneflow/pull/8441)
Memory optimization: Removed the extra buffer problem of Stage 0 CPU copy under Pipeline parallelism. (https://github.com/Oneflow-Inc/oneflow/pull/8484)
Memory optimization: Under Checkpointing and Pipeline, the input identity of the module was de-duplicated to reduce additional Checkpointing tensor, and added the block name prefix of the module to the identity. (https://github.com/Oneflow-Inc/oneflow/pull/8509)
Combination Optimization: ZeRO-DP supported using with Pipeline parallel and 3-D parallel. (https://github.com/Oneflow-Inc/oneflow/pull/8464)
Provided new environment variable optimization switches: ONEFLOW_ENABLE_MULTI_TENSOR_MODEL_UPDATE
and ONEFLOW_FUSE_MODEL_UPDATE_CAST
. In the case of AMP, they supported the fusion of the Optimizer model update kernel and the next round of forward cast operators. (https://github.com/Oneflow-Inc/oneflow/pull/8373)
Enabled export ONEFLOW_EAGER_LOCAL_TO_GLOBAL_BALANCED_OVERRIDE =true
to accelerate the execution of Eager Global, which can save the synchronization of meta information on each rank of Global Tensor. (when users are confident that their code execution is symmetrical, SPMD)(https://github.com/Oneflow-Inc/oneflow/pull/7981)
This environment variable is used to indicate whether the shape of the input data is the same when
local to global
is executed. If it is set to true, there is no need to synchronize the shape of each rank, and the logical shape is calculated locally.
Used python c api to replace pybind11 to optimize the calling speed of tensor and functional.
Optimized functional return types to save overhead and avoid reference copies. And solved the bug that the inplace tensor id may be inconsistent. (https://github.com/Oneflow-Inc/oneflow/pull/7985)
Moved tensor API from pybind11 to c python API. Added tensor hash function. Resolves function naming conflict. (https://github.com/Oneflow-Inc/oneflow/pull/8258, https://github.com/Oneflow-Inc/oneflow/pull/8315, https://github.com/Oneflow-Inc/oneflow/pull/8342, https://github.com/Oneflow-Inc/oneflow/pull/8375)
Performance optimization: Let vm worker threads concentrate on computing tasks, and decoupled memory tasks from computing tasks. (https://github.com/Oneflow-Inc/oneflow/pull/7976)
Optimized the speed of operations in DataLoader, including MakeLocalTensorFromData
, which is 20% faster under swin-T dataloader. (https://github.com/Oneflow-Inc/oneflow/pull/8066)
Optimized global sparse_softmax_cross_entropy
kernel. (https://github.com/Oneflow-Inc/oneflow/pull/7298)
Optimized and sped up CPU permute
kernel with OneDNN. (https://github.com/Oneflow-Inc/oneflow/pull/7872)
Optimized and sped up CPU softmax
kernel with OneDNN. (https://github.com/Oneflow-Inc/oneflow/pull/8071 , https://github.com/Oneflow-Inc/oneflow/pull/8075)
Optimized the memory and speed required for the reverse calculation of the pooling kernel. (https://github.com/Oneflow-Inc/oneflow/pull/7980)
Optimized Slice and Tensor getitem operations based on View to improve the speed of dataloader. (https://github.com/Oneflow-Inc/oneflow/pull/8148, https://github.com/Oneflow-Inc/oneflow/pull/8211, https://github.com/Oneflow-Inc/oneflow/pull/8243)
Optimized the reverse composition logic of flip
and cumsum
, and remove some grad operators. When testing Grad diffs, used random value tests to increase test robustness. (https://github.com/Oneflow-Inc/oneflow/pull/8155)
Optimized the memory usage of the NormalizationAddReluGrad
operator and added versions that does not require addend_diff. (https://github.com/Oneflow-Inc/oneflow/pull/8213)
Optimized and sped up the implementation of tensor.reshape
and tensor.reshape_as
from python implementation to c++ implementation. (https://github.com/Oneflow-Inc/oneflow/pull/8304)
Converted tensor.view
, tensor.view_as
, tensor.permute
, tensor.transpose
, tensor.contiguous_
from python implementation to c++ implementation. (https://github.com/Oneflow-Inc/oneflow/pull/8317)
Greatly optimized the performance of index_select
and repeat_interleave
by using gather to replace dim gather. (https://github.com/Oneflow-Inc/oneflow/pull/8360)
Optimized and removed temporary memory in cumprod cpu grad kernel. (https://github.com/Oneflow-Inc/oneflow/pull/8369)
The embedding
operator supported amp, improved the performance under normal path, and fixed the bug that the gather cpu kernel memory out of bounds. (https://github.com/Oneflow-Inc/oneflow/pull/8374)
Optimized the performance of Tensor.fill_
. (https://github.com/Oneflow-Inc/oneflow/pull/8283)
Greatly optimized the performance of the broadcast element-wise binary family operators in reverse calculation. (https://github.com/Oneflow-Inc/oneflow/pull/8339)
Added fusion operator BinaryCrossEntropyWithLogitsReduceMean. (https://github.com/Oneflow-Inc/oneflow/pull/8476)
Added high-performance matrix multiplication Fused kernel based on cublasLt. (https://github.com/Oneflow-Inc/oneflow/pull/8462, https://github.com/Oneflow-Inc/oneflow/pull/8222, https://github.com/Oneflow-Inc/oneflow/pull/8063)
Exported oneflow env to python and used python's objects to manage its lifecycle. (https://github.com/Oneflow-Inc/oneflow/pull/7792)
Used Python's reference counting to control the life cycle of Graph and constructed strict and rich destruction test cases. (https://github.com/Oneflow-Inc/oneflow/pull/7857)
Supported recycling independent threads that can no longer be reused when Graph is destructed. (https://github.com/Oneflow-Inc/oneflow/pull/7862)
Changed the basic configuration of resource from one-time static effect to real-time effect. (https://github.com/Oneflow-Inc/oneflow/pull/8444)
Consolidated the nccl_comm dynamically created by the Graph NCCL logical kernel into the runtime for initial creation to avoid the deadlock caused by the inconsistency between the creation order of each rank and the eager nccl comm creation order. (https://github.com/Oneflow-Inc/oneflow/pull/8263)
Refactor optimization: Merged nn.graph.util.IONode
, nn.graph.util.IONodeType
into IOArgs. (https://github.com/Oneflow-Inc/oneflow/pull/8272)
Refactor optimization: Renamed the global singleton Global object to the Singleton object. (https://github.com/Oneflow-Inc/oneflow/pull/8490)
Refactor optimization: Removed gpu_device_num (https://github.com/Oneflow-Inc/oneflow/pull/8516)
Refactor optimization: Removed outdated AvailableMemDesc concepts. (https://github.com/Oneflow-Inc/oneflow/pull/8145)
Refactor optimization: Removed outdated Model IO Kernel logic. (https://github.com/Oneflow-Inc/oneflow/pull/8151)
Refactor optimization: Replaced GpuDeviceNum with the actual number of devices to avoid coupling with specific device types. (https://github.com/Oneflow-Inc/oneflow/pull/8166)
C++ is available now. You can manually trigger allocator gc on each stream (applicable in ZeRO)(https://github.com/Oneflow-Inc/oneflow/pull/8452)
The execution of Eager VirtualMachine instruction is based on the execution of EP. (https://github.com/Oneflow-Inc/oneflow/pull/7923)
Optimized and removed all redundant interfaces of Get(Ptr)OrThrow
. (https://github.com/Oneflow-Inc/oneflow/pull/7812)
Added the validity check of flow.save(global_dst_rank)
. (https://github.com/Oneflow-Inc/oneflow/pull/7964)
Supported the backward function node to run multiple times if it does not capture any tensor. (https://github.com/Oneflow-Inc/oneflow/pull/8031)
Added the ThreadLocalCached
decorator to clear the cache in time to alleviate increasing memory. (https://github.com/Oneflow-Inc/oneflow/pull/7858)
Added std for C++14::inclusive_scan/std::exclusive_scan implementations. (https://github.com/Oneflow-Inc/oneflow/pull/8128)
Packaged the parameters required by the eager opkernel and pass them in each thread to solve some thread-unsafe problems. (https://github.com/Oneflow-Inc/oneflow/pull/7617)
Eager Stream supports kernel computation on pinned memory. (https://github.com/Oneflow-Inc/oneflow/pull/8486)
Introduced a tool class for dim range check to replace simplified Functor's various checking logic for dimensions. (https://github.com/Oneflow-Inc/oneflow/pull/8382)
Refactoring and optimization: removed the Blob object in EagerBlobObject, which leads to redundant TensorView instructions. At the same time, in order to support ShapeView efficiently, the elem_cnt attribute has also been removed. (https://github.com/Oneflow-Inc/oneflow/pull/7895)
Refactoring and optimization: extracted the algorithm used by BinAllocator to share dynamic memory pools
Refactoring and optimization: VectorAt
and MapAt
functions uniformly use reference to pass parameters to solve the mixed use of reference interface and pointer interface. (https://github.com/Oneflow-Inc/oneflow/pull/8191)
Refactoring and optimization: removed the cfg application on C++. (https://github.com/Oneflow-Inc/oneflow/pull/8158)
Refactoring and optimization: removed the outdated code related to RemoteBlob in Single-Client. (https://github.com/Oneflow-Inc/oneflow/pull/8228)
Refactoring and optimization: merged duplicate logic in eager boxing ccl and nccl boxing expr. (https://github.com/Oneflow-Inc/oneflow/pull/7930)
Refactoring and optimization: removed cfg on Python and reduced the number of symbols to optimize the link speed of compilation.
Refactoring and optimization: merged symbol::IdCache
and symbol::Storage
. (https://github.com/Oneflow-Inc/oneflow/pull/8331)
Refactoring and optimization: introduced llvm::SmallVetor
and used oneflow::small_vector
instead of fixed_vector
. Besides, we have optimized the implementation and usage of Shape and Stride. (https://github.com/Oneflow-Inc/oneflow/pull/8365 , https://github.com/Oneflow-Inc/oneflow/pull/8402)
Refactoring and optimization: refactored ShapeView and Shape to eliminated duplication and inconsistencies. (https://github.com/Oneflow-Inc/oneflow/pull/8422)
Refactoring and optimization: eager VirtualMachine has decoupled InstructionType's dependency on StreamType. (https://github.com/Oneflow-Inc/oneflow/pull/7607)
Refactoring and optimization: removed the InstructionMsg class and merged all its functions and fields into the Instruction class. (https://github.com/Oneflow-Inc/oneflow/pull/7623)
Stride support:
Tensor, UserOp and UserKernel in user_op::
all supported stride attribute. (https://github.com/Oneflow-Inc/oneflow/pull/7829)
cast
supports stride. (https://github.com/Oneflow-Inc/oneflow/pull/8292)
View support and optimization:
Added a new input tensor to decide whether to support non-contiguous when making op definitions. Besides, we now support transpose
, permute
, narrow
, expand
, expand_as
, split
, chunk
, unfold_tensor
, movedim
, as_strided
, select
, swapaxes
, T
, t
, hsplit
, vsplit
, tensor_split
none-contiguous view ops.(https://github.com/Oneflow-Inc/oneflow/pull/7813)
Tensor slice used view operations by default.(https://github.com/Oneflow-Inc/oneflow/pull/8302)
Automatically generated version status (Feature Stage) for OneFlow's API. (https://github.com/Oneflow-Inc/oneflow/pull/7945)
Optimized CUDA memset to cudaMemsetAsync
(https://github.com/Oneflow-Inc/oneflow/pull/7763)
LeakyReLU
supported inplace optimization. (https://github.com/Oneflow-Inc/oneflow/pull/8060)
Added the following parameters to nn.Embedding
interface: padding_idx
, max_norm
, norm_type
, scale_grad_by_freq
. (https://github.com/Oneflow-Inc/oneflow/pull/8110)
Aligned PyTorch's max_pool_1d
, max_pool_2d
, max_pool_3d
, avg_pool_1d
, avg_pool_2d
, avg_pool_3d
, and distinguish old pooling kernel aligned with TensorFlow. (https://github.com/Oneflow-Inc/oneflow/pull/8111)
VectorAt supported passing in non-const references: JUST(VectorAt(vec, 1)) = 5;
. (https://github.com/Oneflow-Inc/oneflow/pull/8013)
Reduced the uncommon kernel template specializations of layer norm. (https://github.com/Oneflow-Inc/oneflow/pull/8209)
Modified the logic of Tensor.numpy
to avoid extra memory growth when saving the model. (https://github.com/Oneflow-Inc/oneflow/pull/8449)
Tensor str supported printing nd sbp. (https://github.com/Oneflow-Inc/oneflow/pull/8458)
Slice supported SBP infer (S->P), and the semi-automatically deduced sbp was able to selecte the same sbp as expected in the reducible nd_sbp. (https://github.com/Oneflow-Inc/oneflow/pull/8536)
When printing non-CPU and non-CUDA tensor, you must copy to cpu first and then print. (https://github.com/Oneflow-Inc/oneflow/pull/8548)
Refactoring and optimization: decoupling user kernel and device tag. (https://github.com/Oneflow-Inc/oneflow/pull/8529)
Refactoring and optimization: a series of kernels (squeeze
, reshape_like
, flatten
, expand_dims
, reshape
, amp_white_identity
, identity
, identity_buffer
, parallel_cast
, hierarchical_parallel_cast
, hierarchical_parallel_cast_like
) were refactored to CopyDataContentKernel https://github.com/Oneflow-Inc/oneflow/pull/8537
Refactoring and optimization: removed obsolete constant_pad1d
, constant_pad2d
, constant_pad3d
kernel. (https://github.com/Oneflow-Inc/oneflow/pull/8113)
Refactoring and optimization: removed obsolete old lazy upsample
kernel implementation.(https://github.com/Oneflow-Inc/oneflow/pull/8188)
Refactoring and optimization: removed obsolete message in shape proto and used sequential to represent stride. (https://github.com/Oneflow-Inc/oneflow/pull/8220)
Refactoring and optimization: removed obsolete multiply kernel, whick was included in broadcast_mul
. (https://github.com/Oneflow-Inc/oneflow/pull/8359)
Refactoring and optimization: Renamed the shape in UserOp/Kernel to shape_view interface. (https://github.com/Oneflow-Inc/oneflow/pull/8433)
Refactoring and optimization: removed oneflow gemm. (https://github.com/Oneflow-Inc/oneflow/pull/8499)
Optimized the Maybe return type of such interfaces as Scalar.As(). (https://github.com/Oneflow-Inc/oneflow/pull/8348)
Code refactoring ep::CpuDevice
(https://github.com/Oneflow-Inc/oneflow/pull/7911)
Code refactoring: removed hard-coded special decision for device type like "cpu", "cuda" from system code. (https://github.com/Oneflow-Inc/oneflow/pull/8201)
Removed all dnn-related interfaces from the old version of KernelUtil (Primitive will be used to replace those interfaces). (https://github.com/Oneflow-Inc/oneflow/pull/8141)
Removed all interfaces related to mathematical calculation in the old version of KernelUtil (Primitive will be used to replace those interfaces). (https://github.com/Oneflow-Inc/oneflow/pull/8157)
Removed incomplete special decision for 'cuda 'device type in scope util. (https://github.com/Oneflow-Inc/oneflow/pull/8173)
Achieved delayed capture of CUDA Graph(https://github.com/Oneflow-Inc/oneflow/pull/8474)
Code refactoring: removed cuda_event. (https://github.com/Oneflow-Inc/oneflow/pull/8493)
Code refactoring: removed useless WITH_CUDA macro. (https://github.com/Oneflow-Inc/oneflow/pull/8562)
In 0.8.0, we have completed the ability of all kernels to deal with global tensor in distributed situation, and fixed many known bugs related to sbp. The global tensor worked efficiently and correctly at the kernel level. No matter how the distributed topology structure changes, the same algorithm logic can efficiently get mathematically consistent results, which greatly reduced the trouble of verifying correctness in the complex, diverse and asymmetric distributed parallel training process.
Completed some unit tests of Primitive log_softmax
, softmax
, copynd
, Memset
, Memcpy
, matmul
, add
, binary, unary, matmul
, batch_matmul
, fill etc. (https://github.com/Oneflow-Inc/oneflow/pull/8132, https://github.com/Oneflow-Inc/oneflow/pull/8139, https://github.com/Oneflow-Inc/oneflow/pull/8137, https://github.com/Oneflow-Inc/oneflow/pull/8109, https://github.com/Oneflow-Inc/oneflow/pull/8143, https://github.com/Oneflow-Inc/oneflow/pull/8108, https://github.com/Oneflow-Inc/oneflow/pull/8154, https://github.com/Oneflow-Inc/oneflow/pull/8154, https://github.com/Oneflow-Inc/oneflow/pull/8118 , https://github.com/Oneflow-Inc/oneflow/pull/8291)
Improve exception error handling
Added reshape
exception handling. (https://github.com/Oneflow-Inc/oneflow/pull/7847)
Improved the error message of module when the input information does not match. (https://github.com/Oneflow-Inc/oneflow/pull/7918)
Added the MAYBE_NEED_ERROR_MSG_CHECK
environment variable to check whether the CHECK function of Maybe contains oneflow:: Error message. It is used to prompt developers to add error prompt message. (https://github.com/Oneflow-Inc/oneflow/pull/7955)
Improved the exception error message of gather
op.(https://github.com/Oneflow-Inc/oneflow/pull/7979)
Improved LayerNorm
error message. (https://github.com/Oneflow-Inc/oneflow/pull/8090)
Optimized the error message when Eager and Graph encounter multiple inconsistent input placement in op. (https://github.com/Oneflow-Inc/oneflow/pull/8054)
Improved the error message checking in activation-related kernel processing logic.(https://github.com/Oneflow-Inc/oneflow/pull/8080)
Improved the error message in tensor.to_global
and tensor.to_local
. (https://github.com/Oneflow-Inc/oneflow/pull/8067)
Improved the exception error message in the dot
kernel. (https://github.com/Oneflow-Inc/oneflow/pull/8051)
Rewrited the exception check in batch_matmul
kernel. (https://github.com/Oneflow-Inc/oneflow/pull/8186)
Fixed the problem of exception error checking when Python parses arg. (https://github.com/Oneflow-Inc/oneflow/pull/8205)
Improved the exception error checking logic of all array functor. (https://github.com/Oneflow-Inc/oneflow/pull/8116)
Improved the exception error checking logic of all binary functor. (https://github.com/Oneflow-Inc/oneflow/pull/8161)
Improved the exception error reporting logic in nn grad functor. (https://github.com/Oneflow-Inc/oneflow/pull/8210)
Added error message when Graph.build is not reloaded. (https://github.com/Oneflow-Inc/oneflow/pull/8250)
Added TypeError type and device-related error message. (https://github.com/Oneflow-Inc/oneflow/pull/8057)
Improved the error message of Eager SliceBoxing. (https://github.com/Oneflow-Inc/oneflow/pull/8232)
Improved the error message of broadcast op. (Improve the error message of broadcast op)
Improved the error message of Eager Boxing when it is at runtime. (https://github.com/Oneflow-Inc/oneflow/pull/7926)
Improved the error message of Tensor index. (https://github.com/Oneflow-Inc/oneflow/pull/8234)
Improved the error message in nn.functor. (https://github.com/Oneflow-Inc/oneflow/pull/7910)
Added check for Physical Shape when Graph compiles exec_graph. (https://github.com/Oneflow-Inc/oneflow/pull/8002)
Added default error message for CUDA check. (https://github.com/Oneflow-Inc/oneflow/pull/8427)
Added similar error checking information to add n calculation. (https://github.com/Oneflow-Inc/oneflow/pull/8495)
Improved the error message of arg sort. (https://github.com/Oneflow-Inc/oneflow/pull/8513)
Improved the error message of bias add. (https://github.com/Oneflow-Inc/oneflow/pull/8524)
Improved the error message in autograd function. (https://github.com/Oneflow-Inc/oneflow/pull/8496)
Improved the error message of batch gather. (https://github.com/Oneflow-Inc/oneflow/pull/8533)
Improved the error message prompt of defense code in autograd. (https://github.com/Oneflow-Inc/oneflow/pull/8525 , https://github.com/Oneflow-Inc/oneflow/pull/8541)
Supported CUDA 11.5, 11.6. (ttps://github.com/Oneflow-Inc/oneflow/pull/7852 , https://github.com/Oneflow-Inc/oneflow/pull/8423)
Fixed the version of click at 8.0.0. (https://github.com/Oneflow-Inc/oneflow/pull/7967)
Updated nccl version to 2.12.10. (https://github.com/Oneflow-Inc/oneflow/pull/7822)
Default alignment pytorch version 1.10.0. (https://github.com/Oneflow-Inc/oneflow/pull/7019)
Updated tvm oneflow frontend dependencies. (https://github.com/Oneflow-Inc/oneflow/pull/8048)
Updated the version of LLVM/MLIR to support IREE. (https://github.com/Oneflow-Inc/oneflow/pull/8068 , https://github.com/Oneflow-Inc/oneflow/pull/8461)
Fixed the version of protobuf between 3.9.2 to 4.0. (https://github.com/Oneflow-Inc/oneflow/pull/8198)
Removed the cfg tool in cmake. (https://github.com/Oneflow-Inc/oneflow/pull/8218)
The environment variable of CMAKE INTERPROCEDURAL OPTIMIZATION was enabled by default. (https://github.com/Oneflow-Inc/oneflow/pull/8237)
Removed the XRT part in the OneFlow source code, and the OneFlow-XRT will be used as a third-party plugin for oneflow. (https://github.com/Oneflow-Inc/oneflow/pull/8273 ,https://github.com/Oneflow-Inc/oneflow/pull/8288)
Changed Liboneflow to dynamic library. (https://github.com/Oneflow-Inc/oneflow/pull/8312)
Updated the version of clang-tidy to 14.0.4. Supports the following syntax now: NOLINT, NOLINTNEXTLINE, NOLINTBEGIN & NOLINTEND. (https://github.com/Oneflow-Inc/oneflow/pull/8306)
Removed EXTERNAL_INCLUDE_DIRS
, only builds with target. (https://github.com/Oneflow-Inc/oneflow/pull/8421)
Removed obsolete linkages in cmake. (https://github.com/Oneflow-Inc/oneflow/pull/8426)
Improve the running speed and stability of CI
Supported CI to automatically upload built docs.(https://github.com/Oneflow-Inc/oneflow/pull/7894 https://github.com/Oneflow-Inc/oneflow/pull/7917)
Added CI test for IREE. (https://github.com/Oneflow-Inc/oneflow/pull/8419)
Printed the pip package in the container used to test in order to query version information easily. (https://github.com/Oneflow-Inc/oneflow/pull/7952)
Optimized the old version of SpeedTest. (https://github.com/Oneflow-Inc/oneflow/pull/7871 https://github.com/Oneflow-Inc/oneflow/pull/7990 https://github.com/Oneflow-Inc/oneflow/pull/8035)
Optimized the memory used by AutoTest. (https://github.com/Oneflow-Inc/oneflow/pull/7988)
Adjusted the threshold of benchmark. (https://github.com/Oneflow-Inc/oneflow/pull/8043)
Adjusted the timeout threshold. (https://github.com/Oneflow-Inc/oneflow/pull/8103)
Optimized the warning output related to __del__
in CI. (https://github.com/Oneflow-Inc/oneflow/pull/8049)
Optimized the interval of gc to improve the test speed. (https://github.com/Oneflow-Inc/oneflow/pull/8138)
Optimized the use of super Tensor in CI unit test to avoid gc too slow and slow down the running speed of CI. (https://github.com/Oneflow-Inc/oneflow/pull/8177)
Optimized the number of CI build to improve the speed of build. (https://github.com/Oneflow-Inc/oneflow/pull/8229)
Optimized CI workflow, stops all workflows when a job fails. (https://github.com/Oneflow-Inc/oneflow/pull/8255)
Increased maximum parallelism 5 -> 10. (https://github.com/Oneflow-Inc/oneflow/pull/8259)
Strict CI timeout-minutes. (https://github.com/Oneflow-Inc/oneflow/pull/8266)
Supported optional multi-machine testing via the need-test-distributed
tag. (https://github.com/Oneflow-Inc/oneflow/pull/8372)
Tried to use a distributed test cache when testing on multiple machines. (https://github.com/Oneflow-Inc/oneflow/pull/8387/files)
Optimized the test time of global test. (https://github.com/Oneflow-Inc/oneflow/pull/8468)
Optimized the execution time of test_math_ops, test_loss, test_activation, test_tensor_part1, test_tensor_part2 and other eager test. (https://github.com/Oneflow-Inc/oneflow/pull/8494)
Optimized test_convtranspose, test_einsum, test_sqrt_square_sum in expensive eager test. (https://github.com/Oneflow-Inc/oneflow/pull/8504)
Added the test of LiBai in CI. (https://github.com/Oneflow-Inc/oneflow/pull/7537, https://github.com/Oneflow-Inc/oneflow/pull/7929)
Fixed the speed test for Swin-Transformer. (https://github.com/Oneflow-Inc/oneflow/pull/7840)
Added the benchmark test for flow-vision.(https://github.com/Oneflow-Inc/oneflow/pull/7806, https://github.com/Oneflow-Inc/oneflow/pull/8024)
Added compatibility tests for conv_mixer
, densenet
, ghostnet
, googlenet
, inception_v3
, mnasnet
, rexnet
, rexnet_lite
, res2net
, shufflenet_v2
, squeezenet
, convnext
, crossformer
, efficientnet
, levit
, mlp_mixer
, poolformer
, pvt
, res_mlp
, uniformer
, swin_transformer
, senet
and other models. Fixes such compatibility issues as conv2d module padding parameter does not support string; the parameter list of functional.layer_norm is not aligned; meshgrid does not support the input of list[tensor]; adds a interface for tensor.reshape_as. (https://github.com/Oneflow-Inc/oneflow/pull/7942)
Fixed the bug of Swin-Transformer dataloader. (https://github.com/Oneflow-Inc/oneflow/pull/8037)
Added single-node 4-Gpus tests for models such as InsightFace in oneflow_face repository. (https://github.com/Oneflow-Inc/oneflow/pull/8130)
Fixed the bug of nccl deadlock caused by CUDA kernel asynchronous launch limit for nccl logical kernel in 3-D parallelism. (https://github.com/Oneflow-Inc/oneflow/pull/7924)
Fixed cycle import of scope and session. (https://github.com/Oneflow-Inc/oneflow/pull/7993)
Used log_softmax + nll to make sparse_softmax_cross_entropy ms more stable numerically for calculating subgraphs. (https://github.com/Oneflow-Inc/oneflow/pull/7987)
Fixed the bug that B2P boxing misses TaskEdge lbi. (https://github.com/Oneflow-Inc/oneflow/pull/8052)
Fixed the problem that compilation fails due to eager free tensor is not in nn.Graph's job. (https://github.com/Oneflow-Inc/oneflow/pull/8114)
Fixed the possible problem of SegmentFault caused by BlobDesc. (https://github.com/Oneflow-Inc/oneflow/pull/8252)
Solved the bug of circular import in python 3.6. (https://github.com/Oneflow-Inc/oneflow/pull/8268)
Solved the problem that Graph's input and parameter/buffer tensors fail to handle non-contiguous tensors.(https://github.com/Oneflow-Inc/oneflow/pull/8281)
Solved the potential deadlock caused by inconsistent partial order execution of multiple ranks in 3-D parallelism. (https://github.com/Oneflow-Inc/oneflow/pull/8226)
Fixed the bug that Ibverbs failed to start the environment due to incorrect mtu value in special network environment. (https://github.com/Oneflow-Inc/oneflow/pull/8451)
Solved the potential deadlock caused by the partial order execution of each rank when the subsequent subgraph of GradAcc is inserted into the NCCL logical op; at the same time, traverse the subsequent subgraph of GradAcc more comprehensively to solve the problem of missing NCCL op. (https://github.com/Oneflow-Inc/oneflow/pull/8459)
Fixed the bug that NCCL logical kernels does not support bool type. (https://github.com/Oneflow-Inc/oneflow/pull/8455)
Fixed the bug of tensor detach and clone in Graph. (https://github.com/Oneflow-Inc/oneflow/pull/8498)
Aligned DataLoader.__next__
interface (https://github.com/Oneflow-Inc/oneflow/pull/7835)
Fixed backtracking failure when calculating higher-order derivatives, which is caused by the capturing of forward detached tensors via AutoGrad
Fixed inadequate execution of the semantics of sync by Barrier Instruction (https://github.com/Oneflow-Inc/oneflow/pull/7702)
Fixed memory leak caused by imperfect management of VM instruction count
Fixed getitem
when tensor device id is not in the current rank
Fixed global norm
error on gradient calculation for various placements when calling clip grad in pipeline parallelism in eager global mode (https://github.com/Oneflow-Inc/oneflow/pull/7879)
Fixed possible int32 arithmetic overflow caused by Shape.elem_cnt
(https://github.com/Oneflow-Inc/oneflow/pull/8178)
Fixed incorrect results produced by Module.to_global
when introducing parameters (https://github.com/Oneflow-Inc/oneflow/pull/8187)
Fixed extra GPU memory usage in flow.load
and module.load_state_dict
(https://github.com/Oneflow-Inc/oneflow/pull/8301)
Fixed extra GPU memory usage when Optimizer loads models (https://github.com/Oneflow-Inc/oneflow/pull/8310)
Fixed the error occurs when loading models via flow.load
in multi nodes (https://github.com/Oneflow-Inc/oneflow/pull/8314)
Fixed instability of eager caused by the introduction of callback thread (https://github.com/Oneflow-Inc/oneflow/pull/8193)
Fixed tensor.from_numpy
interface to avoid memory leak when the input of numpy is non-contiguous tensor (https://github.com/Oneflow-Inc/oneflow/pull/8391)
Fixed stack overflow when destructing the deep backward computational graph after recursion (https://github.com/Oneflow-Inc/oneflow/pull/8056)
Fixed global SBP inference of unfold
(https://github.com/Oneflow-Inc/oneflow/pull/7883)
Fixed global SBP inference of grid_sample
(https://github.com/Oneflow-Inc/oneflow/pull/7881)
Fixed incorrect pass of values in slice boxing kernel in certain cases (https://github.com/Oneflow-Inc/oneflow/pull/7893)
Fixed eager global inplace (https://github.com/Oneflow-Inc/oneflow/pull/7903)
Fixed SBP inference of upsample
op (https://github.com/Oneflow-Inc/oneflow/pull/7884)
Fixed SBP inference of ScatterAdd
, ScatterUpdate
, and ScatterScalarUpdate
(https://github.com/Oneflow-Inc/oneflow/pull/7807)
Fixed backward memory error of partial_fc
with Global Tensor (https://github.com/Oneflow-Inc/oneflow/pull/8041)
Added support for S0 in randperm
and fixed equal local tensors across all ranks in random op in Split (https://github.com/Oneflow-Inc/oneflow/pull/7571)
Fixed tensor getitem index error in global (https://github.com/Oneflow-Inc/oneflow/pull/8153)
Fixed SBP inference of RoiAlign
and added global unit test (https://github.com/Oneflow-Inc/oneflow/pull/7794)
Fixed SBP inference of stack
op (https://github.com/Oneflow-Inc/oneflow/pull/8181)
Fixed random initialization in median under CPU global (https://github.com/Oneflow-Inc/oneflow/pull/8245)
Fixed SBP inference of narrow
op and added global unit test for narrow
and chunk
(https://github.com/Oneflow-Inc/oneflow/pull/7750)
Improved legal SBP list of batch_matmul
(https://github.com/Oneflow-Inc/oneflow/pull/8385)
Fixed NLLLoss’ failure to support model parallelism (https://github.com/Oneflow-Inc/oneflow/pull/8380)
Fixed S->S and S->P inference in Slice Op SBP infer (https://github.com/Oneflow-Inc/oneflow/pull/8521)
Fixed the bug occurs when Tensor dim is set to -1
Fixed failure for Tensor type to be directly transferred to int and float in Python (https://github.com/Oneflow-Inc/oneflow/pull/7927)
Fixed the bug in Tensor.is_contiguous
that skips initialization when caching and executes random initialization when getting values (https://github.com/Oneflow-Inc/oneflow/pull/7785)
Fixed the bug in Tensor slice view under 1d contiguous (https://github.com/Oneflow-Inc/oneflow/pull/7898)
Fixed incorrect processing of None value by Tensor.__eq__
(https://github.com/Oneflow-Inc/oneflow/pull/7938)
Fixed unaligned memory size in from_numpy
interface (https://github.com/Oneflow-Inc/oneflow/pull/7963)
Fixed incorrect initialization of random seed in Tensor (https://github.com/Oneflow-Inc/oneflow/pull/7904)
Fixed failure of oneflow.Size
to create Tensor with a specified shape (https://github.com/Oneflow-Inc/oneflow/pull/8429)
Aligned alpha
parameter in Tensor.add
(https://github.com/Oneflow-Inc/oneflow/pull/8140)
Fixed failure of add
to support Scalar Tensor (https://github.com/Oneflow-Inc/oneflow/pull/7827)
Fixed failure of reduce_sum
to support Scalar Tensor (https://github.com/Oneflow-Inc/oneflow/pull/7866)
Fixed failure of one_hot
to support Scalar Tensor (https://github.com/Oneflow-Inc/oneflow/pull/7975)
Fixed failure of gather
to support Scalar Tensor (https://github.com/Oneflow-Inc/oneflow/pull/8376)
Fixed “memory access out of bounds” error in dim_scatter
kernel under Scalar Tensor (https://github.com/Oneflow-Inc/oneflow/pull/8418)
Fixed failure of start and end parameters in arrange
op to support Scalar Tensor (https://github.com/Oneflow-Inc/oneflow/pull/8522)
Fixed failure of all
to support Scalar Tensor and 0-Size Tensor (https://github.com/Oneflow-Inc/oneflow/pull/8547)
Fixed failure of conv
and deconv
to support 0-Size Tensor (https://github.com/Oneflow-Inc/oneflow/pull/8001)
Fixed failure of cuda_check_numerics
to support 0-Size Tensor (https://github.com/Oneflow-Inc/oneflow/pull/8050)
Fixed failure of expand
and advanced_index
to support 0-Size Tensor (https://github.com/Oneflow-Inc/oneflow/pull/8094)
Fixed the bug occurs when processing 0-Size Tensor in repeat_interleave
kernel and removed relevant special judge in gather
(https://github.com/Oneflow-Inc/oneflow/pull/8414)
Fixed failure of diag
to support 0-Size Tensor (https://github.com/Oneflow-Inc/oneflow/pull/8557)
Fixed sorting in nms
unit test (https://github.com/Oneflow-Inc/oneflow/pull/7831)
Fixed torch alignment of beta and threshold interfaces of softplus
op (https://github.com/Oneflow-Inc/oneflow/pull/7888)
Fixed failure of expand
to support passing tuples as parameters (https://github.com/Oneflow-Inc/oneflow/pull/7913)
Fixed computation failure in randperm
when n is too large (https://github.com/Oneflow-Inc/oneflow/pull/7908)
Fixed failure relative to list or tuple in parameter passing in meshgrid
(https://github.com/Oneflow-Inc/oneflow/pull/7933)
Fixed nn.functional.conv2d
bug that all parameters must be specified (https://github.com/Oneflow-Inc/oneflow/pull/7892)
Fixed failure of rand
and randn
to support tuple as an input (https://github.com/Oneflow-Inc/oneflow/pull/7914)
Fixed the bug occurs in concat
when inputs are of inconsistent data types (https://github.com/Oneflow-Inc/oneflow/pull/7921)
Fixed wrong device id got by generator in certain cases in randn
,dropout
, randint
, rand
, random_mask_like
, and randperm
(https://github.com/Oneflow-Inc/oneflow/pull/7896)
Fixed inconsistent behaviors of __shfl_sync
under sm_61
in layernorm
(https://github.com/Oneflow-Inc/oneflow/pull/7978)
Fixed failure of scatter
op to support negative dim (https://github.com/Oneflow-Inc/oneflow/pull/7934)
Fixed the bug in scatter
op nd update value(https://github.com/Oneflow-Inc/oneflow/pull/7953)
Fixed failure of masked_select
to support certain Broadcast operations in eager mode (https://github.com/Oneflow-Inc/oneflow/pull/7984)
Fixed the bug in PReLU
op when dispatching num_blocks (https://github.com/Oneflow-Inc/oneflow/pull/8004)
Fixed misused numpy forced synchronization logic in index_select
python and transferred the logic to functor for implementation (https://github.com/Oneflow-Inc/oneflow/pull/7965)
Aligned dtype parameter in prod
(https://github.com/Oneflow-Inc/oneflow/pull/7932)
Fixed the bug occurs when ord = 0
in linalg.vector_norm
op; Fixed check on nan/inf by clip_grad (https://github.com/Oneflow-Inc/oneflow/pull/8007)
Fixed failure of min
and max
to operate on inconsistent dtypes (https://github.com/Oneflow-Inc/oneflow/pull/8021)
Added num_batches_tracked
buffer to batch_norm
to facilitate transfer of ResNet-18, a torch pretrained model, to OneFlow (https://github.com/Oneflow-Inc/oneflow/pull/7920)
Fixed the misuse of logf
, expf
, and powf
in math kernel (https://github.com/Oneflow-Inc/oneflow/pull/8038)
Fixed exclusion of dtype parameters in cumsum
and cumprod
and provided Tensor.cumsum
and Tensor.cumprod
methods (https://github.com/Oneflow-Inc/oneflow/pull/8065)
Fixed possible overflow when dtype is not int64 in non_zero
op (https://github.com/Oneflow-Inc/oneflow/pull/7907)
Aligned sum
, mean
, all
, any
, and prod
operations in reduce
(https://github.com/Oneflow-Inc/oneflow/pull/8085)
Fixed incorrect backward computation in cumprod
(https://github.com/Oneflow-Inc/oneflow/pull/8136)
Aligned alpha
parameter in sub
operation (https://github.com/Oneflow-Inc/oneflow/pull/8026)
Fixed shape inference in upsample
op (https://github.com/Oneflow-Inc/oneflow/pull/8105)
Fixed failure of addn
inplace operation on CPU tensor (https://github.com/Oneflow-Inc/oneflow/pull/8280)
Fixed limit on tensor size in cum
backward op based on the size of shared memory (https://github.com/Oneflow-Inc/oneflow/pull/8289)
Improved the logic of dtype inference for arange
op (https://github.com/Oneflow-Inc/oneflow/pull/8338)
Fixed NaN propagation of UnaryFunctor (https://github.com/Oneflow-Inc/oneflow/pull/8346)
Fixed ndim check of pad
(https://github.com/Oneflow-Inc/oneflow/pull/8354)
Fixed vector check in broadcast_min
and broadcast_max
backward computations (https://github.com/Oneflow-Inc/oneflow/pull/8379)
Fixed the bug relative to index computation logic in cumprod
op (https://github.com/Oneflow-Inc/oneflow/pull/8388)
Fixed possible int32 overflow in softmax
and math unary / binary cuda kernel; for kernels that operate integer division on i
in CUDA_1D_KERNEL_LOOP
, provided if
statement to branch computations to prevent performance loss in most cases when int32 works (https://github.com/Oneflow-Inc/oneflow/pull/8472)
Fixed failure to pass size via size=(...)
in random ops (normal
, rand
, randn
, randint
, and randperm
) (https://github.com/Oneflow-Inc/oneflow/pull/8506)
Fixed error in cudaGetDeviceCount
when CUDA device count=0 (https://github.com/Oneflow-Inc/oneflow/pull/8184)
Fixed possible unregistration of devices caused by hob.ToString
method; Used static local variables to establish dependency between static variables of device registration and the static code for device registration (https://github.com/Oneflow-Inc/oneflow/pull/8235)
Fixed cudaErrorNoDevice
caused by drive errors (https://github.com/Oneflow-Inc/oneflow/pull/8262)
Fixed memory leak caused by realpath (https://github.com/Oneflow-Inc/oneflow/pull/8540)
Introduced AutogradCapturedTensor in backward computation to avoid circular reference and allow correct backtracking to the input gradient node in higher order derivative graph (https://github.com/Oneflow-Inc/oneflow/pull/7808)
Added higher order derivative of sin/cos
op; Fixed autograd
bugs relative to higher order derivative (https://github.com/Oneflow-Inc/oneflow/pull/8163)
Fixed bugs in backward computation in concat
and split_like
to support higher order derivative (https://github.com/Oneflow-Inc/oneflow/pull/8208)
Fixed RTD [sphinx] failure to build docstr (https://github.com/Oneflow-Inc/oneflow/pull/7901)
Fixed compilation failure caused by opencv copy header failure (https://github.com/Oneflow-Inc/oneflow/pull/7944)
Fixed failure to generate a new .so
in compilation when CMAKE_LINK_DEPENDS_NO_SHARED=YES
(https://github.com/Oneflow-Inc/oneflow/pull/7868)
Fixed Eigen url in cmake third party (https://github.com/Oneflow-Inc/oneflow/pull/8223)
Fixed the bug caused by multi-time linking to libof_protoobj in XRT (https://github.com/Oneflow-Inc/oneflow/pull/8326)
Made libproto a dynamic library to avoid collision between static global variables (https://github.com/Oneflow-Inc/oneflow/pull/8345)
Made of_pyext_obj
static only when there is one Python extension dynamic library that has Python symbols (https://github.com/Oneflow-Inc/oneflow/pull/8393)
Fixed the bug in undefined symbol: del_curterm
in source code compilation (https://github.com/Oneflow-Inc/oneflow/issues/8398)
Fixed false positive warning in gcc11 compilation (https://github.com/Oneflow-Inc/oneflow/pull/8401)
Fixed SegFault that occurs when unzipping dataset in the container by making zlib a dynamic library (https://github.com/Oneflow-Inc/oneflow/pull/8481)
Fixed undefined reference of culibosTlsSetValue (https://github.com/Oneflow-Inc/oneflow/pull/8479)
Fixed stringop-truncation compilation error for gcc9 (https://github.com/Oneflow-Inc/oneflow/pull/8532)
Disabled static link of Simple CI and enabled debug build to avoid too many symbols (https://github.com/Oneflow-Inc/oneflow/pull/7940)
Fixed the bug in AutoTest fake program; Fixed print error in AutoTest (https://github.com/Oneflow-Inc/oneflow/pull/8279; https://github.com/Oneflow-Inc/oneflow/pull/8290)
Disabled conv3d test temporarily for its relatively large error of random values (https://github.com/Oneflow-Inc/oneflow/pull/7969)
Reduced test error in nn.LayerNorm (https://github.com/Oneflow-Inc/oneflow/pull/7941)
Optimized input data range of certain math op tests (https://github.com/Oneflow-Inc/oneflow/pull/8010)
Fixed incorrect unit test case in permute
(https://github.com/Oneflow-Inc/oneflow/pull/8083)
Aligned error message of chunk to torch (https://github.com/Oneflow-Inc/oneflow/pull/8096)
Fixed incorrect use of permute
in tensor tests (https://github.com/Oneflow-Inc/oneflow/pull/8144)
Fixed omission of test cases in instancenorm
(https://github.com/Oneflow-Inc/oneflow/pull/8215)
Adjusted unit test threshold for leaky_relu
(https://github.com/Oneflow-Inc/oneflow/pull/8242)
Annotated cpu bn grad method that tests with random values (https://github.com/Oneflow-Inc/oneflow/pull/8257)
Skipped test cases of global argmax
and median
in multi-GPU scenarios (https://github.com/Oneflow-Inc/oneflow/pull/8264)
Adjusted unit test threshold for fused_dot_feature_interaction
(https://github.com/Oneflow-Inc/oneflow/pull/8293)
Disabled unit tests for conv_transpose1d
, conv_transpose2d
, and conv_transpose3d
(https://github.com/Oneflow-Inc/oneflow/pull/8319)
Adjusted tolerance setting in embedding_renorm unit test (https://github.com/Oneflow-Inc/oneflow/pull/8394)
Removed test cases with excessive accumulated elements in test_fused_dot_feature_interaction_pooling_sum
to avoid overly large sum error (https://github.com/Oneflow-Inc/oneflow/pull/8425)
Ensured that all PyTorch references in OneFlow API documentation belong to the same PyTorch version (1.10.0) (https://github.com/Oneflow-Inc/oneflow/pull/8058)
Added "copy" button for code in API docs to facilitate trial runs of sample code (https://github.com/Oneflow-Inc/oneflow/pull/7997)
Refined script that automatically generates version status for OneFlow APIs and fixed bugs in docs (https://github.com/Oneflow-Inc/oneflow/pull/8546)
Refined interface documentation of Tensor and Module (https://github.com/Oneflow-Inc/oneflow/pull/7823)
Refined Tensor.to_global
interface documentation and added descriptions of gard_sbp
Refined Tensor.to_local
interface documentation
Added Tensor Attributes docs for oneflow.placement
, oneflow.env.all_device_placement
, and oneflow.sbp.sbp
Added interface documentation for Module.to_consistent
(outdated) and Module.to_global
Fixed invalid links in Tensor docs and updated consistent
to global
(https://github.com/Oneflow-Inc/oneflow/pull/7821)
Added docstr for Tensor.sqrt
, Tensor.square
, Tensor.addmm
, Tensor.cosh
, Tensor.diagonal
, Tensor.log
, Tensor.ndim
, and Tensor.rsqrt
(https://github.com/Oneflow-Inc/oneflow/pull/7841)
Enabled derived classes of pybind11 to add documentation for non-overriding methods and added interface documentation related to Tensor and autograd (https://github.com/Oneflow-Inc/oneflow/pull/7849)
Refined documentation of oneflow.argsort
(https://github.com/Oneflow-Inc/oneflow/pull/7844)
Refined documentation of Tensor.zero_
, Tensor.is_contiguous
, Tensor.is_cuda
, and oneflow.nn.functional.layer_norm
op (https://github.com/Oneflow-Inc/oneflow/pull/7839)
Refined interface documentation of support_sparse
and step
in oneflow.optim.Adamw
, oneflow.optim.SGD
(https://github.com/Oneflow-Inc/oneflow/pull/7848)
Refined interface documentation of LambdaLR.step
, ReduceLROnPlateau.in_cooldown
, and ReduceLROnPlateau.is_better
(https://github.com/Oneflow-Inc/oneflow/pull/7848)
Refined interface documentation of nn.Module
(https://github.com/Oneflow-Inc/oneflow/pull/8190)
Refined interface documentation of oneflow.optim.lr_scheduler.PolynomialLR
(https://github.com/Oneflow-Inc/oneflow/pull/8430)
Refined docs and formula illustrations for oneflow.nn.CombinedMarginLoss
(https://github.com/Oneflow-Inc/oneflow/pull/8206)
Refined documentation of oneflow.logical_and
, oneflow.logical_or
, oneflow.logical_xor
, and oneflow.logical_not
(https://github.com/Oneflow-Inc/oneflow/pull/8297)
Fixed the bug in the documentation of quantization ops (https://github.com/Oneflow-Inc/oneflow/pull/8333)
Updated solution in Troubleshooting for the case when libunwind.h
is not found (https://github.com/Oneflow-Inc/oneflow/pull/8336)
Restructured API documentation based on features; added and refined docs of features that are unique to OneFlow (https://github.com/Oneflow-Inc/oneflow/pull/8392)
Published by jackalcooper over 2 years ago
OneFlow v0.7.0 came out. Welcome to use it. We would love to hear your feedback!
https://mp.weixin.qq.com/s/dSR-2Xw92eoFhF0c6MtutQ
This release has the following highlights:
Provides a Tensor that can be executed in multi-nodes multi-GPUs scenarios: Global Tensor. It is an easy-to-use solution for distributed execution. It makes it easier to implement various distributed parallel strategies and enables more flexible and user-friendly distributed implementation. It supports models including ResNet50, Wide and Deep, GPT, Bert, Swin-Transformer, InsightFace, etc.
Continues to improve nn.Graph. Supports the advanced features such as ZeRO, GradAcc, Checkpointing, and Pipelining, and enriches the graph.debug mode. Supports random 2D SBP conversion, semi-automatic derivation of 2D SBP, resuming training from the last checkpoint, etc. Adds OneFlow Feature Stages Identifications and identifies each feature of nn.Graph. For nn.Graph, its basic features are at the Beta Stage, which can meet most of the requirements of users; Advanced features are at Alpha Stage, meeting standard requirements.
Deeply optimizes the performance of Eager mode. The performance of the Swin-Transformer model is 3 times higher than that of v0.6.0 when tested on the V100.
Operators-related improvements: In the single-node single-GPU scenario, OneFlow's compatibility with PyTorch is further improved. The interfaces, semantics, and produced results of operators supported by OneFlow are in consistent with that of operators supported by PyTorch and an automatic testing framework is designed to verify the consistency. With common models, you can accomplish the migration by running import oneflow as torch
. Compared with v0.6.0, OneFlow adds 16 operators, optimizes the performance of 6 operators, and fixes bugs in 16 operators.
Supports Einsum and View mechanism.
Compiler-related improvements: OneFlow is officially connected to the MLIR ecosystem.
Releases OneFlow-Serving v0.1.0: We provide an out-of-the-box Triton OneFlow backend docker image. try here.
Releases LiBai v0.1.0, a toolbox for massively distributed parallel training of Transformer. Compared with customized code bases such as Megatron-LM, LiBai provides a series of models and training components for distributed training based on a modular design, aiming to make models trained in distributed mode as convenient as in single-GPU mode.
Releases Flow-Vision v0.1.0: adds DeiT, ConvNeXt, ReXNet, and other models and updates tutorials and documentation.
OneFlow Feature Stages identifies the maturity level of OneFlow features. It provides users with a status description of a feature to inform the specific level of it, such as completeness, API stability, documentation, etc. It Provides OneFlow developers with a standard for feature refinement, which facilitates further improvement.
OneFlow Feature Stages
Stable Stage
Release Candidate (RC) Stage
Beta Stage
Alpah Stage
Pre-alpha Stage
Global Tensor is a newly released set of distributed computing interfaces. It can easily support any parallelism including data parallelism, model parallelism, and pipeline parallelism. Unlike a normal Tensor (hereafter called Local Tensor), Global Tensor is a Tensor with a global view, whose data is distributed in a specific way across a set of devices in a cluster, and each node stores some or all of the Global Tensor's data. Placement and SBP are the basic properties of the Global Tensor that describe the distribution of the data in clusters.
Global Tensor supports three different ways of data distribution, which we collectively refer to as SBP.
dim
dimension and distributed to each device.Global Tensor has basically the same computational interfaces as Local Tensor. Only with small changes, you can convert the single-GPU mode to the distributed mode.
>>> import oneflow as flow
>>> x = flow.tensor([1.0, 2.0])
>>> y = x * x
>>> import oneflow as flow
>>> x = flow.tensor([1.0, 2.0],
placement=flow.placement("cuda", ranks=[0, 1]),
sbp=flow.sbp.split(0))
>>> y = x * x
# This multiplication is performed on both rank 0 and rank 1
With Tensor.to_global interface, you can create a Global Tensor based on a Local Tensor, and regard this tensor as the local tensor of the Global Tensor on the present device.
With Tensor.to_local interface, you can return the local tensor of the Global Tensor on the present device.
>>> import oneflow as flow
>>> x = flow.tensor([1.0, 2.0],
placement=flow.placement("cuda", ranks=[0, 1]),
sbp=flow.sbp.split(0))
>>> y = x.to_local()
>>> y.size()
oneflow.Size([1])
>>> y
tensor([1.], device='cuda:0', dtype=oneflow.float32)
# tensor([2.], device='cuda:0', dtype=oneflow.float32) if rank is 1
With Tensor.to_global interface, you can redistribute the data of Global Tensor in clusters. The data can be distributed to another set of nodes and the way of distribution in this set of nodes can also be changed (i.e.change SBP). Redistribution usually generates inter-process data communication, but Tensor.to_global interface finely avoids complicated low-level communication details.
>>> import oneflow as flow
>>> x = flow.tensor([1.0, 2.0], placement=flow.placement("cuda", ranks=[0, 1]), sbp=flow.sbp.split(0))
>>> y = x.to_global(placement=flow.placement("cuda", ranks=[2, 3]), sbp=flow.sbp.broadcast)
Each operator of OneFlow defines a set of SBP signatures for the input and output tensor. Global Tensor supports automatic redistribution to provide the required SBP signature of a certain interface. Just as the code shown below:
>>> import oneflow as flow
>>> x = flow.randn(4, 4,
placement=flow.placement("cuda", ranks=[0, 1]),
sbp=flow.sbp.split(0))
>>> y = flow.randn(4, 4,
placement=flow.placement("cuda", ranks=[0, 1]),
sbp=flow.sbp.split(1))
>>> z = x + y
When x + y
is executed, since x is split along 0
dimension while y is split along 1
dimension, their local tensors at each device can not be added up directly. Therefore, x's SBP will be automatically converted to flow.sbp.split(1)
or y's SBP will be converted to flow.sbp.split(0)
, and the calculated result-z's SBP- is flow.sbp.split(1)
or flow.sbp.split(0)
.
Global Tensor doesn't support mix-in with DDP interface currently.
Global Tensor requires all devices to execute simultaneously, and the code that has branches would lead to process deadlock because of divergent execution paths. We will continue fixing this problem.
Fundamental features enter into Beta Stage, meeting most requirements of users;
Advanced features enter into Alpha Stage, meeting standard requirements of users;
ResNet50, Wide and Deep, GPT, Bert, Swin-Transformer, InsightFace, and other models are supported;
Static and dynamic casting of operators under Static Graph enter into Beta Stage from Alpha Stage
Adds the unit test of static execution for all legal operators under nn.Graph, and automated unit test is ready;
Supports more flexible inputs and outputs, including List/Tuple/Dict and their nesting, and fixs the Tuple problem of producing a return size of "1";
Adds backward automatic test;
Optimizer and LR Scheduler under Static Graph enter into Beta Stage from Alpha Stage.
Adds more built-in LR schedulers, including WarmupLR, CosineAnnealingWarmRestarts and other common schedulers, and provides SequentialLR and ChainedScheduler to enable scheduler with different combination capacity;
Refactors scheduler's get_lr function, converting it to the implementation of pure function. This change permits to use schedulers in combination by changing the calculation of lr from iterative solution to analytical solution;
Adds "is_sparse" parameter for add_optimizer
interface, supporting sparse updates under graph mode. Optimizers that support sparse updates include Adam and SGD, while optimizers under Eager mode don't support sparse updates yet. Subsequent version will support both sparse updates and sparse tensor. The feature is at Pre-alpha Stage;
Adds Debug print feature for LR and Step, for which you only need to turn on LR Scheduler's verbose
button.
state_dict
and load_state_dict
under Static Graph are newly added, which allow to resume training from last checkpoint. The feature is at Beta Stage;
Debug under Static Graph enters into Beta Stage from Alpha Stage;
Adds debug(2)
、debug(3)
that allow to find out problems in nn.Module, by locating the Python code of operators at c++ layer and locating forward graph creation and inference for operators;
Adds the display of memory overhead
ZeRO-DP under Static Graph is newly added, which allows to reducememory overhead related to Optimizer under data parallelism, and the feature is at Alpha Stage;
Global Tensor under Static Graph supports multiple parallel methods, and the feature is between Alpha Stage and Beta Stage;
It is utilized in LiBai and other model libraries;
It is widely utilized in OneFlow's model libraries, and the coverage of unit test is still ongoing;
1D Global Tensor supports you to only define input tensor's SBP, while output tensor's SBP can be derived automatically with good results, and the feature is at Beta Stage;
2D Global Tensor supports you to only define input tensor's SBP, while output tensor's SBP can be derived automatically with good results, and the feature is at Alpha Stage;
Conversion from 1D to ND or ND to 1D is newly supported, and the feature is at Alpha Stage;
Random conversion of 2D SBP is newly supported, and the feature is at Alpha Stage;
Testing of 1D&2D single operator is still ongoing, and the feature is at Pre-alpha Stage;
Selecting SBP with semi-automatic derivation is supported, and the feature is at Pre-alpha Stage;
For Gradient Accumulation under Static Graph, we refactor and repair support for Reshape and add API documentation. For the input of mini-batch
interface, the future version will offer the input of micro-batch
with better experience, and the feature is from Pre-Alpha to Alpha Stage;
For pipeline parallelism under Static Graph, the tutorial is perfected, and pipeline parallelism is available in Libai and other model libraries. The feature is at Beta Stage;
For automatic mixed precision (AMP) under Static Graph, the API documentation is newly added. The feature is from Pre-Alpha to Alpha Stage;
For Activation Checkpointing under Static Graph, the API documentationis newly added. The feature is from Pre-Alpha to Alpha Stage;
For Op Fuse optimization under Static Graph, the API documentationis newly added. The feature is from Pre-Alpha to Alpha Stage;
For XLA/TensorRT/OpenVINO execution under Static Graph, the API documentationis newly added. The feature is from Pre-Alpha to Alpha Stage;
Tutorials
API Documentation
Tutorials of pipeline parallelism:
The performance of Eager is deeply optimized. When OneFlow run Swin-Transformer's model performance on V100 GPU, single-GPU card delivers a 25% speedup than PyTorch, and 8 single GPU card 10% speedup;
The communication scheduling policy for NCCL in DDP is optimized;
DDP supports the optimization of AllReduce fuse, reducing additional overhead generated by fragmented AllReduce, with a 5% performance speedup when it is tested on ResNet50;
VM supports the optimization of instruction fusion, significantly saving scheduling overhead of Kernel;
Additional memory overhead is optimized when CPU overload is too high;
Eager DataLoader supports the optimization of inter-process memory sharing;
The performance of Clip Grad is optimized;
The performance of CPU operators such as unary and binary element-wise is improved by 4 times, and the speed of Swin-Transformer's dataloader is improved by 2.5 times. https://github.com/Oneflow-Inc/oneflow/pull/7319
Adds the functionality of inter-process shared memory to Dataloader, which greatly improves the performance of DataLoader in DDP.
Adds Bool type Tensor. https://github.com/Oneflow-Inc/oneflow/pull/7523
Realizes to_contiguous that view relied on. https://github.com/Oneflow-Inc/oneflow/pull/7670
Adds Scalar div operators. https://github.com/Oneflow-Inc/oneflow/pull/7483
Adds Lamb optimizer. https://github.com/Oneflow-Inc/oneflow/pull/7389
Adds Polynomial Learning Rate Scheduler. https://github.com/Oneflow-Inc/oneflow/pull/7260
Adds tensor_split and as_strided operators. https://github.com/Oneflow-Inc/oneflow/pull/7258 & https://github.com/Oneflow-Inc/oneflow/pull/7275
Adds cumprod operators. https://github.com/Oneflow-Inc/oneflow/pull/7278
Adds Tensor.T() and oneflow.t() operators. https://github.com/Oneflow-Inc/oneflow/pull/7269
Adds normalize operators. https://github.com/Oneflow-Inc/oneflow/pull/7113
Adds the inplace version of div and sub operators. https://github.com/Oneflow-Inc/oneflow/pull/7293
Adds the feature of Module.zero_grad. https://github.com/Oneflow-Inc/oneflow/pull/7587/
Adds the feature of Scalar Tensor being the index to do list indexing. https://github.com/Oneflow-Inc/oneflow/pull/7597
Adds support for Leaky ReLU operators half type. https://github.com/Oneflow-Inc/oneflow/pull/7569
Adds support for mask select operators. https://github.com/Oneflow-Inc/oneflow/pull/7492
Adds non-reduce communication operations such as Bool type Broadcast and Allgather. https://github.com/Oneflow-Inc/oneflow/pull/7366
Develops autotest that supports eager global based on an autotest framework. https://github.com/Oneflow-Inc/oneflow/pull/7204
Optimizes performance for ReduceSum CUDA Kernel. https://github.com/Oneflow-Inc/oneflow/pull/7684
Optimizes CUDA Kernel of gather operators. https://github.com/Oneflow-Inc/oneflow/pull/7351
Optimizes the performance for CUDA Kernel of MaxPool and AvgPool operators in NCHW. https://github.com/Oneflow-Inc/oneflow/pull/7426 & https://github.com/Oneflow-Inc/oneflow/pull/7451
Optimizes the backward computing of PReLU operators, which can save more memory in general. https://github.com/Oneflow-Inc/oneflow/pull/7600
Optimizes backward Kernel of LayerNorm to further save memory. https://github.com/Oneflow-Inc/oneflow/pull/6996
Supports passing single int in stride and dilation in Conv1D/2D/3D and DeConv1D/2D/3D Kernel. Adds Tensor.zero_() interface that aligns with PyTorch tensor.norm, torch.max and torch.min.
Supports inplace in flow.nn.functional.dropout. https://github.com/Oneflow-Inc/oneflow/pull/7593
Fixes bug where the BatchNorm module raises an error when affine=False. https://github.com/Oneflow-Inc/oneflow/pull/7755
Fixes Maximum and Mimimum backward bug. https://github.com/Oneflow-Inc/oneflow/pull/7519
Fixes bug where the result of var operators is unexpected in some cases. https://github.com/Oneflow-Inc/oneflow/pull/7517
Fixes incorrect behavior of Tensor deepcopy bug. https://github.com/Oneflow-Inc/oneflow/pull/7490
Fixes bug where input index is scalar tensor in slice operators. https://github.com/Oneflow-Inc/oneflow/pull/7479
Fixes bug where BinaryCrossEntropy can produce nan in half. https://github.com/Oneflow-Inc/oneflow/pull/7476
Fixes bug where an error is raised when the base and exponent of pow operators are respectively real number type and Tensor type. https://github.com/Oneflow-Inc/oneflow/pull/7729
Fixes stack operators backward bug. https://github.com/Oneflow-Inc/oneflow/pull/7363
Fixes inefficiency problem caused by CPU synchronization when clip grad is executed on CUDA with the default configuration. https://github.com/Oneflow-Inc/oneflow/pull/7304
Fixes the SBP inference of Batch Gather and Unsorted Batch Segment Sum operators, and runs the global unittest successfully. https://github.com/Oneflow-Inc/oneflow/pull/7590
Fixes Physical Shape inference of Affine Grid operators, fixes the unexpected result bug in some SBP cases, and runs the global unittest successfully. https://github.com/Oneflow-Inc/oneflow/pull/7578
Fixes the problem that arange operators don't support generating 0 size tensor, and runs the global unittest successfully. https://github.com/Oneflow-Inc/oneflow/pull/7576
Fixes the incorrect SBP inference of flip operators, and runs the global unittest successfully. https://github.com/Oneflow-Inc/oneflow/pull/7496
Fixes advanced indexing and zeroslike operators SBP bugs. https://github.com/Oneflow-Inc/oneflow/pull/7238
Fixes bug where Eager global inplace might not be successful. https://github.com/Oneflow-Inc/oneflow/pull/7348
Adds einsum
operators. einsum
provides a set of concise but elegant rules, which can implement tensor operations including but not limited to: inner product, outer product, tensor multiplication, tensor transposition and tensor contraction, etc. Proficient use of einsum
allows you to easily implement various complex tensor operations and be less error-prone. https://github.com/Oneflow-Inc/oneflow/pull/7526
Adds view
mechanism. The view mechanism allows the common operators to reuse/share Tensor's memory, and the memory can be saved by reducing the Kernel Launch/Compute process. At present, new view operators that do not change the tensor.is_contiguous() property have been added, such as reshape, view, squeeze, unsqueeze, etc.: https://github.com/Oneflow-Inc/oneflow/pull/7503 More view operators will be added later (such as transpose, permute, narrow, expand, and unfold).
OneFlow is officially connected to the MLIR ecosystem, and the OneFlow Dialect component is complete. Successfully completes OneFlow Job (computation graph of OneFlow nn.Graph) and RoundTrip of MLIR, and runs RoundTrip tests on all operators of OneFlow in CI process.
Implements static graph optimization with a series of automatic fused operators based on MLIR DRR to accelerate OneFlow model training and inference.
OneFlow Serving v0.1.0 comes out with the following features:
Provides OneFlow C++ API used for inference, supporting model loading and static graph inference.
The model weights and the computation graph in MLIR format can be saved simultaneously by running flow.save(graph)
in Python. They can be loaded in C++ API (while loading computation graph is not supported in Python API at present).
Supports inference of OneFlow model using TensorRT and OpenVINO automatically without model conversion (based on OneFlow XRT module), achieving better acceleration on NVIDIA GPU and Intel CPU.
Implements Triton OneFlow backend
Welcome to use the project deployed with Triton OneFlow backend launched on OneFlow Cloud Platform.
LiBai is a toolbox for massively distributed parallel training of Transformer. Compared with custom code bases such as Megatron-LM, LiBai provides a series of models and training components for distributed training based on a modular design, aiming to make models trained in distributed mode as convenient as in single-GPU mode. The 0.1.0 version mainly supports the following features and models:
Features:
Trainer
and Evaluator
Models:
Bert
(3D Parallelism)GPT-2
(3D Parallelism)ViT
(3D Parallelism)Swin-Transformer
(Data Parallelism)projects/
projects/
flowvision 0.1.0 stable version comes out with the following improvements based on the previous version:
trunc_normal_
DeiT
model, rebuilt VisionTransformer
modelConvNeXt
modelReXNet
modelPolyLRScheduler
and TanhLRScheduler
F.normalize
in SSD modelEfficientNet
and Res2Net
vit_small_patch32_384
and res2net50_48w_2s
modelsmodel zoo
and runs more complete tests on existing modelsload_state_dict_from_url
method to automatically save the downloaded weights in the cache folderGetting Started
and flowvision.models
The 0.2.0 version of flowvision is already in progress. A large number of new models will be added based on the 0.1.0 version, and the documentation will be improved, so stay tuned.
Published by jackalcooper almost 3 years ago
OneFlow has been open sourced for 528 days since July 31,2020. Today OneFlow v0.6.0 came out. Welcome to use OneFlow v0.6.0. We would love to hear feedback!
This version mainly updates three parts: framework, model, and OneFlow-ONNX. Hightlights include:
The following are the detailed release notes.
import oneflow as torch
Users can customize autograd.Function just like using Torch.
Serving functionality of models is provided by OneFlow as Nvidia Triton's backend.
ResNet
, DenseNet
, VGG
, ResNext
, EfficientNet
, etcViT
, PVT
, Swin-Transformer
, etcMlp-Mixer
, Res-MLP
, g-MLP
, etcsketch
, candy
, mosaic
, rain_princess
, and undie
For data augmentation operations like CenterCrop
and ColorJitter
similar to torvhvision, developers can run import flowvision as torchvision
to execute in most scenarios.
Advanced data augmentation opertations implemented in flowvision.data:
Non-Local
, SELayer
, CBAM
, BAM
, ECA
, etcPatchEmb
, Pooler
, ConvBnAct
, etcdrop-path
, drop-block
, and stochastic depth
to improve model generalization abilityactivation
and weight_init
to improve components like activation function
and initialize method
Updated OneFlow to ONNX toolkit:
pip install oneflow-onnx
to experiencePublished by jackalcooper about 3 years ago
oneflow.compatible.single_client
import torch
for existing Pytorch projects. You could test it by inter-changing import oneflow as torch
and import torch as flow
.Here is a minimum example showcasing how to incorporate a nn.Module
in a nn.Graph
and have it run in lazy mode.
class NeuralGraph(flow.nn.Graph):
def __init__(self, ...):
super().__init__()
self.model = model # model is a nn.Module instance
def build(self, x):
y_pred = self.model(x)
return y_pred
graph = NeuralGraph() # to create a nn.Graph instance
y_pred = graph(x) # to run the created nn.Graph
-DTREAT_WARNINGS_AS_ERRORS=OFF
#6008
*parallel_distribution*
to *nd_sbp*
(1) #5815
unpack_call_dispatcher
for better performance #5820
JUST_MSG
and CHECK_JUST_MSG
#5904
raise RuntimeError
#5890
ParallelDistribution
class to NdSbp
#5814
nn.AdaptiveAvgPool1d
and nn.AdaptiveAvgPool3d
#5445
Published by jackalcooper about 3 years ago
oneflow.compatible.single_client
import torch
for existing Pytorch projects. You could test it by inter-changing import oneflow as torch
and import torch as flow
.Here is a minimum example showcasing how to incorporate a nn.Module
in a nn.Graph
and have it run in lazy mode.
class NeuralGraph(flow.nn.Graph):
def __init__(self, ...):
super().__init__()
self.model = model # model is a nn.Module instance
def build(self, x):
y_pred = self.model(x)
return y_pred
graph = NeuralGraph() # to create a nn.Graph instance
y_pred = graph(x) # to run the created nn.Graph
-DTREAT_WARNINGS_AS_ERRORS=OFF
#6008
*parallel_distribution*
to *nd_sbp*
(1) #5815
unpack_call_dispatcher
for better performance #5820
JUST_MSG
and CHECK_JUST_MSG
#5904
raise RuntimeError
#5890
ParallelDistribution
class to NdSbp
#5814
nn.AdaptiveAvgPool1d
and nn.AdaptiveAvgPool3d
#5445
oneflow.compatible.single_client
import torch
for existing Pytorch projects. You could test it by inter-changing import oneflow as torch
and import torch as flow
.Here is a minimum example showcasing how to incorporate a nn.Module
in a nn.Graph
and have it run in lazy mode.
class NeuralGraph(flow.nn.Graph):
def __init__(self, ...):
super().__init__()
self.model = model # model is a nn.Module instance
def build(self, x):
y_pred = self.model(x)
return y_pred
graph = NeuralGraph() # to create a nn.Graph instance
y_pred = graph(x) # to run the created nn.Graph
-DTREAT_WARNINGS_AS_ERRORS=OFF
#6008
*parallel_distribution*
to *nd_sbp*
(1) #5815
unpack_call_dispatcher
for better performance #5820
JUST_MSG
and CHECK_JUST_MSG
#5904
raise RuntimeError
#5890
ParallelDistribution
class to NdSbp
#5814
nn.AdaptiveAvgPool1d
and nn.AdaptiveAvgPool3d
#5445
Published by jackalcooper about 3 years ago
oneflow.compatible.single_client
import torch
for existing Pytorch projects. You could test it by inter-changing import oneflow as torch
and import torch as flow
.Here is a minimum example showcasing how to incorporate a nn.Module
in a nn.Graph
and have it run in lazy mode.
class NeuralGraph(flow.nn.Graph):
def __init__(self, ...):
super().__init__()
self.model = model # model is a nn.Module instance
def build(self, x):
y_pred = self.model(x)
return y_pred
graph = NeuralGraph() # to create a nn.Graph instance
y_pred = graph(x) # to run the created nn.Graph
-DTREAT_WARNINGS_AS_ERRORS=OFF
#6008
*parallel_distribution*
to *nd_sbp*
(1) #5815
unpack_call_dispatcher
for better performance #5820
JUST_MSG
and CHECK_JUST_MSG
#5904
raise RuntimeError
#5890
ParallelDistribution
class to NdSbp
#5814
nn.AdaptiveAvgPool1d
and nn.AdaptiveAvgPool3d
#5445
Published by jackalcooper over 3 years ago
在这个版本,我们为 OneFlow 新增了大量的功能,0.4.0 是 OneFlow 自开源以来最大的更新。在这个版本中,我们增加了 2-D SBP、流水并行,Checkpoint 的新的接口,以及大量对齐 pytorch 的接口,还支持了 CUDA 11.2。在之前,我们已经开源了 OneFlow 的 GPT 源码,其中大量使用了这个版本的各种新特性,同时也欢迎移步阅读《OneFlow —— 让每一位算法工程师都有能力训练 GPT》这篇文章。
with flow.scope.placement("gpu", "0:0-3", (2, 2)):
x = flow.hierarchical_parallel_cast(
x, parallel_distribution=["B", "S(1)"]
)
with flow.scope.placement("gpu", "0:0-3", (4,)):
x = flow.hierarchical_parallel_cast(
x, parallel_distribution=["S(0)"]
)
pipeline_stage
的 scopewith flow.experimental.scope.config(
pipeline_stage_id_hint=dist_util.get_layer_stage(layer_idx)
):
...
func_cfg = flow.FunctionConfig()
...
func_cfg.train.num_gradient_accumulation_steps(args.num_accumulation_steps)
@flow.global_function(..., function_config=func_cfg)
func_cfg = flow.FunctionConfig()
...
func_cfg.optimizer_placement_optimization_mode(mode) # mode = "non_distributed" or "distributed_split"
@flow.global_function(..., function_config=func_cfg)
mode = "distributed_split"
对应 DeepSpeed ZeRO 优化的 stage 2with flow.experimental.scope.config(
checkpointing=True
):
欢迎阅读相关文章:亚线性内存优化—activation checkpointing在oneflow中的实现
oneflow.experimental
命名空间,部分对齐 torch.xxx
接口新接口的使用方法
import oneflow.experimental as flow
flow.enable_eager_execution() # 启用 eager
目前部分对齐的功能
flow.nn.Conv2d <-> torch.nn.Conv2d
flow.nn.BatchNorm2d <-> torch.nn.BatchNorm2d
flow.nn.ReLU <-> torch.nn.ReLU
flow.nn.MaxPool2d <-> torch.nn.MaxPool2d
flow.nn.AvgPool2d <-> torch.nn.AvgPool2d
flow.nn.Linear <-> torch.nn.Linear
flow.nn.CrossEntropyLoss <-> torch.nn.CrossEntropyLoss
flow.nn.Sequential <-> torch.nn.Sequential
flow.nn.Module.to <-> torch.nn.Module.to
flow.nn.Module.state_dict <-> torch.nn.Module.state_dict
flow.nn.Module.load_state_dict <-> torch.nn.Module.load_state_dict
flow.save <-> torch.save
flow.load <-> torch.load
flow.Tensor <-> torch.Tensor
flow.tensor <-> torch.tensor
flow.tensor.to <-> torch.tensor.to
flow.tensor.numpy <-> torch.tensor.numpy
flow.tensor 加减乘除 <-> torch.tensor 加减乘除
flow.tensor.flatten <-> torch.tensor.flatten
flow.tensor.softmax <-> torch.tensor.softmax
flow.optim.SGD <-> torch.optim.SGD
基于上述模块已经可以轻松搭建常用网络,如:ResNet、BERT、MobileNetV3 等。后续版本将对齐/支持更多接口,届时可将大多数基于 Pytorch 搭建的网络,轻松切换到 OneFlow。
快速上手例子 lenet: https://github.com/Oneflow-Inc/models/blob/main/quick_start_demo_lenet/lenet.py
新接口文档链接:https://oneflow.readthedocs.io/en/master/experimental.html
对齐 torch vision 的 ResNet50 示例代码:https://github.com/Oneflow-Inc/models/tree/main/resnet50
接下里的几个版本会增加更多 对齐 PyTorch 的接口
experimental
下对齐的接口在 0.6.0 版本更新时会被移动到 oneflow 的命名空间下,届时会完全对齐 PyTorch,OneFlow 0.6.0 会将 eager 作为默认的执行方式
eager 模式目前只支持单 GPU 运行,在 0.5.0 会支持多 GPU 运行
之前一个 OneFlow 的版本采取的是“不同包名,相同版本名”的规则,如 oneflow_cu102==0.3.4
,从 0.4.0 之后将采取“相同包名,不同版本名”的规则,如oneflow==0.4.0+cu102
,最新安装方式请参考 README Install with Pip Package章节
stable 版本和 nightly 版本的 OneFlow 都支持了 CUDA 11.2 平台(cu112)
ONNX 模块目前在新仓库 https://github.com/Oneflow-Inc/oneflow_convert_tools 中维护,OneFlow 主仓库中 的 ONNX 相关的代码将在下个版本移除,具体细节可以看《深度学习框架OneFlow是如何和ONNX交互的?》 一文。oneflow_convert_tools 目前是针对 OneFlow 的 lazy 模式开发,目前最新版本号为 v0.3.2,后面针对 eager 模式的 oneflow_convert_tools 版本号将从 0.4.0 开始
在下一个版本的 OneFlow 中,将包含更全面的 PyTorch 兼容,包括更多更丰富的接口支持以及多 GPU 支持。同时,下个版本的 OneFlow 也将支持动静图转换的功能。敬请期待!
Published by jackalcooper almost 4 years ago
Published by jackalcooper almost 4 years ago
swish
and mish
namespace from math
to nn
#4104
MaxWithLogThreshold
and SafeLog
header only #4030
pack_size
in GenericLauncher #4014
Published by jackalcooper almost 4 years ago
Published by jackalcooper almost 4 years ago
Published by jackalcooper about 4 years ago
mean_square
and add unit tests for optimizers #3523
Published by jackalcooper about 4 years ago
Published by jackalcooper about 4 years ago
Published by jackalcooper about 4 years ago