Version 0.8.0
OneFlow v0.8.0 Release Note
OneFlow v0.8.0 came out, welcome to install the new version for a better experience.
- Highlights
- Backwards Incompatible Change
- Deprecations
- New Features
- Performance
- Improvements
- Bug fixes
- Documentation
Highlights
This update contains 523 commits and the following highlights:
-
PyTorch compatible APIs have been further optimized, 68 new APIs aligned with PyTorch have been added, and 84 compatibility bugs between operator and interface have been fixed. More PyTorch models support being one-button transferred into OneFlow.
-
All operators support Global Tensor more completely and efficiently, 28 Global Tensor-related bugs have been fixed, and 180 operator unit tests have been newly added.
-
Graph's advanced features have been further optimized:
-
In addition to the existing ZeRO-DP, Zero Redundancy Optimizer(ZeRO) can also be used in combination with MP parallelism, 2D parallelism, and 3D parallelism, which saves more memory overhead.
-
Graph provided new pipeline parallelism API, which not only simplifies the pipeline parallelism configuration but also optimizes the performance of pipeline parallelism and 3D parallelism.
-
Multi-dimensional debugging functionality in the logic graph, light plan physical graph, memory analysis, Python stack information, and others have been newly added, making Graph.debug more efficient.
-
-
Empowered by OneFlow v0.8.0 and LiBai v0.2.0, 3D parallelism speed under GPT and BERT witnesses a notable increase, and its training speed performance exceeds Megatron-LM with same configuration in multiple dimensions. For more details, please click here.
-
OneEmbedding has been released recently. It is an extension component designed for large-scale recommendation systems, boasting high efficiency, extensibility, flexibility, and other advantages.
-
Multi-Device adaptation: OneFlow v0.8.0 has provided a neat, efficient, and easily-extensible hardware abstraction layer called EP(Execution Provider) and defined a collection of basic computing interfaces called Primitive, allowing to re-implement kernels based on Primitive interface.
-
Added new debugging tool stacks: OneFlow-Profiler and AutoProf
-
OneFlow-Profiler is a tool designed to collect performance information during framework execution. It can record the execution time of operators and system components, the allocation of memory and DRAM, and the corresponding input and parameters of operators. The information can help developers find out the main source of overhead in framework execution and thus implement targeted optimization.
-
AutoProf is a framework designed to efficiently detect the alignment between OneFlow APIs and PyTorch APIs. Besides, it can automatically compare the performance results of OneFlow APIs and PyTorch APIs.
-
-
Significantly optimized the exception handling process in OneFlow API and improved the error message when APIs meet exceptions.
-
Significantly optimized the OneFlow API documentation: the API documentation has been restructured based on functionality. In addition to general operator APIs,
oneflow.nn.graph
,oneflow.embedding
,oneflow.autograd
and other modules in OneFlow and their environment variables have also been explained in detail.
Backwards Incompatible Change
- Graph has been re-designed to configure ZeRO API, which saves configuration and learning cost for users. Besides, the latest ZeRO supports 2D mixed parallelism that contains model parallelism and pipeline parallelism, and 3D parallelism.(#8036, #8404, #8464)
Outdated configuration method in OneFlow v0.7.0:
import oneflow as flow
class Graph(flow.nn.Graph):
def __init__(self):
super().__init__()
self.linear = flow.nn.Linear(3, 8, False)
self.config.set_zero_redundancy_optimizer_mode("distributed_split")
if zero_stage > 1:
# stage 2
flow.boxing.nccl.enable_use_compute_stream(True)
if zero_stage > 2:
# stage 3
flow.boxing.nccl.disable_group_boxing_by_dst_parallel(True)
def build(self, x):
return self.linear(x)
graph = Graph()
New interface in OneFlow v0.8.0:
import oneflow as flow
class Graph(flow.nn.Graph):
def __init__(self):
super().__init__()
self.linear = flow.nn.Linear(3, 8, False)
self.config.enable_zero(stage=2)
def build(self, x):
return self.linear(x)
graph = Graph()
Deprecations
Python API
- The outdated parameter
axis
(remains compatible) inoneflow.sbp.split()
has been uniformly changed into usingdim
to represent the slice dimension.(#8411)
v0.7.0
oneflow.sbp.split(axis=0)
v0.8.0
oneflow.sbp.split(dim=0)
- For the outdated pipeline parallelism configuration method
self.module_layer_0.config.stage_id = 0
(this method is not suggested ), we have added a novel pipeline parallelism APIconfig.set_stage
, which optimizes pipeline parallelism performance as well as avoids implementing theinput_tensor.to_global(placement=this_stage_placement)
operation for all module input tensors at every stage. (#8442)
v0.7.0
import oneflow as flow
B = [flow.sbp.broadcast]
P_0 = flow.placement(type = "cuda", ranks = [0, 1])
P_1 = flow.placement(type = "cuda", ranks = [2, 3])
class Graph(flow.nn.Graph):
def __init__(self):
super().__init__()
self.m_stage0 = flow.nn.Linear(8, 8, False).to_global(placement=P_0, sbp=B)
self.m_stage1 = flow.nn.Linear(8, 8, False).to_global(placement=P_1, sbp=B)
# Set different module's stage id to hint the graph preparing right num of buffers in pipeline.
self.m_stage0.config.stage_id = 0
self.m_stage1.config.stage_id = 1
self.config.set_gradient_accumulation_steps(4)
def build(self, x):
x = x.to_global(placement=P0, sbp=B)
y = self.m_stage0(x)
# Move tensor between different pipeline stages.
y = y.to_global(placement=P1, sbp=B)
z = self.m_stage1(y)
return z
v0.8.0
class Graph(flow.nn.Graph):
def __init__(self):
super().__init__()
self.m_stage0 = flow.nn.Linear(8, 8, False).to_global(placement=P_0, sbp=B)
self.m_stage1 = flow.nn.Linear(8, 8, False).to_global(placement=P_1, sbp=B)
# set_stage(stage_id, placement)
# The Stage ID is numbered starting from 0 and increasing by 1.
# The Placement is all tensors placement of this module.
self.m_stage0.config.set_stage(stage_id=0, placement=P_0)
self.m_stage1.config.set_stage(stage_id=1, placement=P_1)
self.config.set_gradient_accumulation_steps(4)
def build(self, x):
# There will be automatically do tensor.to_global(placement) for all input tensor of this module.
# So there is no need to write to_global() in/out of the module forward function.
y = self.m_stage0(x)
z = self.m_stage1(y)
return z
New Features
Graph
-
Added new interfaces:
oneflow.env.init_rdma
andoneflow.env.rdma_is_initialized
to delay turning on the RDMA, thus accelerating the network communications across multiple devices (Note: avoid using fork() after RDMA being turned on, for example, DataLoader’snum_workers > 1
should be executed beforeinit rdma
). #8415 -
Graph provided new algorithm optimization interface:
graph.config.enable_straighten_algorithm
to optimize the execution order in computation graph, which maximizes the overlap between transferring and computation. With this interface, the data transfer speed witnesses a 0.6% rise in data parallelism mode and a 6% rise in model parallelism mode. (#8347, #8483, #8495 ) -
Optimized the implementation of clip grad in Graph to support
clip_grad_max_norm > 1.0
and provided configurableclip_grad_norm_type
, which could only be set to2
before but now can be set to+/- inf
,+/- 1
,+/- 2
,+/- 3
, and bigger p-norm values. See the reference from here (#7548) -
Global tensor in Graph supported the
tensor.set_item
operation for invariable ops, for example,mask[:, :len_keep] = 0
(#7751) -
Graph exported
build_graph
andcompile_and_init_runtime
interfaces, allowing to compile thepass
that was previously self-defined by users after building the graph, thus rewriting and optimizing the graph. The two interfaces also supported Graph to restore an external graph (job). (#8168) -
Added the
RegisterJobPass
interface to support rewriting the self-defined external job pass graph. (#8370) -
oneflow.boxing.nccl.enable_use_compute_stream(True)
optimized supports for NCCL logical kernel: -
Added the efficient fused kernel
oneflow.nn.FusedMLP
, which is controlled byexport ONEFLOW_FUNCTOR_DISABLE_FUSED_MLP = 0
(#7391, #8165, #8217, #8413)
Debug
-
Graph.debug
offered the new parameter:max_stack_depth (default = 2)
to note the maximal stack depth of the Python stack where the op exists in Graph, making it convenient to locate the Python context for each op in Graph. (#8028) -
Apart from supporting printing the input/output/variable info of modules in Graph, it also newly supported printing operator info constructed in module forward. (#8135)
-
Enabled
export ONEFLOW_DEBUG_MODE=true
andexport GLOG_v=3
to print the full memory log, which contains multi-level MemBlock info on each device (Total Memory-> Chunk -> MemBlock), Block that has exclusive memory, Eager Variable and other information. Besides, a lifecycle label was added in Regst to analyze each tensor's memory lifecycle. -
LightPlan provided a more simplified way to display Actor Graph, cutting down the cost of debug based on Plan. When
ONEFLOW_DEBUG_MODE = true
, a series of light plan files corresponding to each rank in Graph will be generated under thelog/local_rank_0/machine/
directory, containing simplified actor sub-graphs in each rank, and the filename isGraphName_rank_i_light_plan
. (#8396) -
The
print graph
method allowed to display the logic graph by Module, making the debugging more efficient in constructing graphs. (#8131)
Eager
-
Supported passing extra parameters when Optimizer ParamGroup is being built, meeting other special operation demands for LrScheduler. (#7753)
-
param_groups = [{"params": [model.parameters()], "excess_param": ...}] optim = optim.Adam(param_groups, lr=0.1)
-
-
Added the
oneflow.cuda.current_device
interface to return the device index of the current rank (#7856) -
Added the
oneflow.utils.from_torch
interface to convert a PyTorch Tensor into an OneFlow Tensor (#7851) -
Added the
oneflow.utils.to_torch
interface to convert an OneFlow Tensor into a PyTorch Tensor (#7851) -
Added the
oneflow.cuda.empty_cache
interface to manually release memory #8482) -
Added the
oneflow.roc_auc_score
interface on CPU, which is equivalent tosklearn.metrics.roc_auc_score
(#7951)
Tensor
-
Provided the
Tensor.contiguous_
interface as the contiguous operation for the inplace version (#8275) -
Added the
Tensor.local_to_global
andTensor.global_to_global
interfaces to separately implement different default check meta operations (#8027) -
Global Tensor's Slice/SliceUpdate supported all nd_sbp inputs, and SliceUpdate fully supported the inplace operation and backpropagation (#8313, #8337, #8344, #8416)
Global Boxing
-
Eager Global Tensor supported balanced spliter nd sbp eager boxing (#7768)
-
Supported executing Eager Slice Boxing on random devices, including non-CPU devices and non-CUDA-capable devices (#8180)
OneEmbedding
For better recommendations, modern recommendation systems always rely on huge Embedding tables. Besides, frequent iterations of user data require model training to be fast enough.
OneEmbedding is a component designed for large-scale recommendation systems, and it's efficient, extensible, and highly flexible. The following are its advantages:
-
Hierarchical storage and dynamic capacity expansion: users can expand the capacity of the Embedding at much lower cost.
-
Mixed parallelism strategy: it supports easily extending the model to train it on multi-machine multi-GPU.
-
Embedding quantization for better communication: in the parallel scenario, communication data can be quantized to reduce the communication amount, thus accelerating the training.
-
Efficient data pipeline: the model parts that have no data dependency can be executed in advance, thus overlapping with other operations in time.
-
Automatic mixed precision training: data can be computed in FP16 to reduce the occupied memory, thus accelerating the training speed and ensuring high model convergence precision.
-
A collection of efficient CUDA ops for common operations in recommendation systems is available.
-
Flexible model building is supported.
See OneEmbedding API documentation from here.
PyTorch Compatibility
A collection of new functionalities and interfaces that are compatible with PyTorch 1.10.0 have been added.
Tensor
-
Added the
Tensor.pin_memory
functionality, which supports changing the memory to pinned memory when the tensor is being created. (#8073) -
Added the
~Tensor
(invert) method to conduct logic NOT operation for each tensor with the dtype of .bool. (#7899) -
Added the
Tensor.log2
method to get log2 for each tensor. (#7906) -
Added the
Tensor.new_zeros
method to generate a new tensor that has a shape of 0. (#7937) -
Added the
oneflow.as_tensor
interface to convert the input data into a new tensor that shares data. (#7855) -
Added the
Tensor.__array__
method.np.array
supports to input oneflow tensor to constructnp.ndarry
object. (#7970) -
Added the
Tensor.new_tensor
method to copy the input data to generate a new tensor. (#7973) -
Added the
Tensor.half
method, which is equivalent totensor.to (oneflow.float16)
. (#7971) -
Added the
Tensor.byte
method to generate a new uint8 tensor, andtensor.byte()
is equivalent totensor.to(oneflow.uint8)
. (#8053) -
Added the
Tensor.view_as
andTensor.new_empty
methods (#8077) -
Added the
Tensor.type
method to implement corresponding cast and adding objects foroneflow(.cuda).{Byte, Char, Short, Int, Long, Half, Float, Double}Tensor
(#8129) -
Added the
Tensor.dot
method to compute the dot product of two 1D tensors, and this method is equivalent tooneflow.dot
. (#8520) -
Added the
oneflow.nn.init.orthogonal_
interface to initialize tensors (#8009)
Operators
-
Added the
oneflow.nn.Softshrink
op (#7826) -
Added the
oneflow.nn.Threshold
op (#7875) -
Added the
oneflow.nn.Hardshrink
activation function (#7887) -
Added the
oneflow.isnan
andoneflow.isinf
interfaces to decide the element in tensor is nan or inf (#7943) -
The
oneflow.nn.functional.*
interface supported passing thenumpy scalar
parameter (#7935) -
Added the
oneflow.nn.functional.cosine_similarity
op to calculate the cosine similarity of two tensors (#8119) -
Added the
oneflow.nn.functional.conv_transpose1d
, theoneflow.nn.functional.conv_transpose2d
op, and thenn.functional.conv_transpose3d
op (#7991) -
Added the
oneflow.unbind
interface to return a tuple of all slices along a given dimension (#7730) -
Added the
oneflow.swapdims
interface to specify the swapping of two dimensions, andoneflow.swapdims
is equivalent to NumPy’sswapaxes
. (#7659) -
Added the
oneflow.addcmul
op to execute an element-wise composite function:out=input+value×tensor1×tensor2
(#7282) -
Added the
oneflow.searchsorted
op (#7949) -
Added the
oneflow.mm
op (#8440) -
Added the
oneflow.tensordot
interface and offered a collection of cases of equivalent transformation operations (#7968) -
Added the
oneflow.repeat_interleave
op to repeat the elements of the tensor, and this op is equivalent tonumpy.repeat
(#8324) -
Added the
oneflow.amax
andTensor.amax
methods (#7996) -
Added the
oneflow.median
andTensor.median
methods (#8069) -
Added the
oneflow.normal
method and fixed theTensor.normal
method (#7956) -
Added the
oneflow.amin
andTensor.amin
methods (#8042) -
Added the
oneflow.mv
op andTensor.mv
method (#8445)
Random
- Added new interfaces:
oneflow.cuda.manual_seed
,oneflow.cuda.manual_seed_all
,oneflow.seed
,oneflow.manual_seed
,oneflow.initial_seed
,oneflow.get_rng_state
,oneflow.set_rng_state
and improved the configuration of OneFlow random seed initialization. (#7957 )
AutoGrad
-
Added new interfaces:
oneflow.set_grad_enabled
andoneflow.enable_grad
to enable or disable automatic gradient update for some of subgraphs. (#8016) -
Supported the upstream gradient dtype of the autograd reverse operator is different from that of the input. (#8233, #8309)
-
Supported the backward operator that does not capture any tensor to execute backward computation multiple times. (#8031)
CUDA
- Added APIs for
oneflow.cuda.set_device
andoneflow.cuda.synchronize
. (#8322)
RNN
-
Refactored the Module of RNN and migrated the implementation of Python layer splicing to C++, which greatly optimized the performance. Added modules related to RNNCell and modules aligned with the
torch.nn.utils.rnn
in functionality:- Refactored modules:
RNN
,LSTM
, andGRU
- Added modules:
RNNCell
,LSTMCell
,GRUCell
, andoneflow.nn.utils.rnn
- Supported and fixed RNN unit tests of local and global, and completed documentation.
- Refactored modules:
Device
Supported heterogeneous equipment type: In order to cope with the complexity of different hardware, OneFlow, in line with the dependency inversion principle in software engineering, has introduced a hardware abstraction layer called Execution Provider (EP). The hardware abstraction layer is composed of a series of interfaces, which are abstracted from the capabilities provided by the required hardware devices during the running of the framework. After the hardware abstraction layer is introduced, each modules can directly call the interface provided by the hardware abstraction layer, not the original hardware interface, to use the underlying hardware, so it's unneccessary to concern the specific details of the underlying hardware. When a new hardware device is introduced, because the hardware abstraction interface remains unchanged, all modules can adapt to the new hardware device without any modification. At the same time, when adapting new hardware for the framework, we do not need to pay attention to the specific implementation details of the framework. We only need to implement a series of interfaces according to the agreement of the hardware abstract interface and the actual situation of the hardware device, and then the hardware adaptation can be completed.
Execution Provider has defined a collection of runtime interfaces: device registration interface, device management interface, queue management interface, event management interface, and memory management interface.
Primitive
In addition to the runtime interfaces, the Execution Provider has also defined a set of computing interfaces called Primitive, which are used to describe the commonly-used computation in the deep learning framework, thus simplifying the development of operators in hardware adaptation. Compared with the runtime interfaces provided by the Execution Provider, the interfaces provided by Primitive are more loose and flexible. All interfaces are mutually independent, and each interface represents a specific computing capability provided by a certain hardware device. Similar to runtime interfaces, the abstraction of interfaces provided by Primitive is closer to the device side, and developers can carry out adaptation work without an in-depth understanding of OneFlow's mechanism. Developers need to implement all interfaces provided by Execution Provider when adapting runtime interfaces, but in the process of adapting Primitive, developers can selectively adapt according to the actual situation of the project.
-
Added unit test of
ep::primitive
basic function (#8099) -
Added
ep::primitive::constant_pad
, optimized performance, removed obsolete pad grad and used pad as the inverse of pad (#8152) -
Used unary primitive interface instead of original implementation in Kernel (#8270)
-
Added environment variable ONEFLOW_EP_CUDA_CUBLAS_WORKSPACE_SIZE_MB to configure cublas workspace size (#8478)
-
Scalar logical kernel supported primitives (#8531)
-
Used primitives to implement logical not kernel (#8544)
-
Migrated all activation kernels to use primitive (#8300)
-
Bias add kernel supported primitive (#8512)
-
Decoupled OneDNN from
ep::primitive
CPU device and provided environment variableONEFLOW_ENABLE_ONEDNN_OPTS
to enable onednn to accelerate CPU primitive interface (#8274)
Debug tools
-
Saved the log independently for each rank to
log/local_rank_{i}
when launching multiple processes by launcher. (#7825) -
Optimized the display of OF_PROFILER_RANGE_GUARD in nsys. (#8121)
OneFlow-Profiler
OneFlow-Profiler is designed to collect various performance-related information during the execution flow of the framework. It can calculate the execution time of the operator or system components, the allocation of memory and DRAM, and can record the input and parameter information corresponding to the operator. This information can be used by developers to analyze which part brings the most overhead and implement some targeted optimizations.
-
Added OneFlow-Profiler. (#8047)
-
Profiled the information of the CUDA operator. (#8195)
-
Profiled the bandwidth information of the operator. (#8254)
-
Added interfaces to collect bandwidth information and optimized code implementation. (#8332)
-
Refined Profiler. (#8332)
-
Used Kineto and CUPTI to profile the information of CUDA operator. (#8417)
Auto-Test
- When the value check fails, the value of the input tensor and Paramter will be automatically printed, and the pseudo-code segment of the output program will be highlighted for debugging (#8383)
AutoProf
AutoProf is a framework designed to test the performance of OneFlow and PyTorch operators. It can automatically test the operator performance and print a comparison table under different CPU threads and GPUs. At present, it has been applied to the development of some existed operators and all new operators. Its effect is shown below:
-
Added auto speed comparison framework of operator AutoProf to automatically run op to test: (#8207)
-
The speed of OneFlow and PyTorch.
-
The speed of CPU/GPU Kernel under different numbers of threads.
-
Total end-to-end time with CPU Kernel.
-
-
Optimized the display of AutoProf to save testing time. (#8303)
-
Supported API tests without actual kernel execution, and the time would be end2end. (#8320)
-
Supported AutoProf to measure kernel bandwidth. (#8367)
IR
-
Used Cast to remove pass. (#7837 )
-
Used MLIR to complete constant folding, combined the composition optimization of Conv and BN. (#7799)
-
Optimized constant folding in OneFlow C++ API. (#8124)
-
Provided fault tolerance checking for parsed module. (#8299)
-
Fixed the BUG of constant folding unit test. (#8340)
-
Supported IREE. (#8249)
-
Added
oneflow_iree(python)
to CI. (#8431) -
Removed redundant output_lbns in IR. (#8409)
-
Provided a conversion marker for Variable -> constant. (#8412)
-
Removed hardcoded properties in IR. (#8420)
-
Implemented AutoNHWC Pass and provided environment variable
ONEFLOW_MLIR_PREFER_NHWC
. Supported automatic conversion of common network data formats to channels last optimization and had a noticeable acceleration on NVIDIA graphics cards that support FP16. (#7890)
Performance
Graph
-
Optimized the speed and memory of GPT and BERT under 3-D parallelism:
-
Performance optimization:
fused_scale_mask_softmax
operator supported broadcast input. Optimized the kernel implementation and performance of softmax under specific cols (1024). Optimized the incomplete GetSbp list offused_scale_mask_softmax
reverse operator. (#8321) -
Communication optimization: Optimized the communication cost of SBP cost under
B->S
,B->B
,B->P
. (#8378) -
Interface optimization: Optimized the inefficient edge connection problem caused by the misalignment of stage id and to_global sequence dependency when using pipeline stage. (#8442)
-
Communication optimization:
nccl_use_compute_stream
supported more comprehensive sbp conversions likeP -> S(i)
. (#8361) -
Communication optimization: Parallel use of RDMA communication. (#8415)
-
Memory optimization: Eliminated the randomness of the memory multiplexing algorithm, so that the memory multiplexing effect of each rank is consistent when the subgraphs are the same. There will be no bad case. (#8441)
-
Memory optimization: Removed the extra buffer problem of Stage 0 CPU copy under Pipeline parallelism. (#8484)
-
Memory optimization: Under Checkpointing and Pipeline, the input identity of the module was de-duplicated to reduce additional Checkpointing tensor, and added the block name prefix of the module to the identity. (#8509)
-
Combination Optimization: ZeRO-DP supported using with Pipeline parallel and 3-D parallel. (#8464)
- Memory optimization: Removed extra identity tensor in ZeRO optimization. (#8407)
-
-
Provided new environment variable optimization switches:
ONEFLOW_ENABLE_MULTI_TENSOR_MODEL_UPDATE
andONEFLOW_FUSE_MODEL_UPDATE_CAST
. In the case of AMP, they supported the fusion of the Optimizer model update kernel and the next round of forward cast operators. (#8373)
Eager
-
Enabled
export ONEFLOW_EAGER_LOCAL_TO_GLOBAL_BALANCED_OVERRIDE =true
to accelerate the execution of Eager Global, which can save the synchronization of meta information on each rank of Global Tensor. (when users are confident that their code execution is symmetrical, SPMD)(#7981)This environment variable is used to indicate whether the shape of the input data is the same when
local to global
is executed. If it is set to true, there is no need to synchronize the shape of each rank, and the logical shape is calculated locally. -
Used python c api to replace pybind11 to optimize the calling speed of tensor and functional.
-
Performance optimization: Let vm worker threads concentrate on computing tasks, and decoupled memory tasks from computing tasks. (#7976)
-
Optimized the speed of operations in DataLoader, including
MakeLocalTensorFromData
, which is 20% faster under swin-T dataloader. (#8066)
Operators & Tensor
-
Optimized global
sparse_softmax_cross_entropy
kernel. (#7298) -
Optimized and sped up CPU
permute
kernel with OneDNN. (#7872) -
Optimized and sped up CPU
softmax
kernel with OneDNN. (#8071 , #8075) -
Optimized the memory and speed required for the reverse calculation of the pooling kernel. (#7980)
-
Optimized Slice and Tensor getitem operations based on View to improve the speed of dataloader. (#8148, #8211, #8243)
-
Optimized the reverse composition logic of
flip
andcumsum
, and remove some grad operators. When testing Grad diffs, used random value tests to increase test robustness. (#8155) -
Optimized the memory usage of the
NormalizationAddReluGrad
operator and added versions that does not require addend_diff. (#8213) -
Optimized and sped up the implementation of
tensor.reshape
andtensor.reshape_as
from python implementation to c++ implementation. (#8304) -
Converted
tensor.view
,tensor.view_as
,tensor.permute
,tensor.transpose
,tensor.contiguous_
from python implementation to c++ implementation. (#8317) -
Greatly optimized the performance of
index_select
andrepeat_interleave
by using gather to replace dim gather. (#8360) -
Optimized and removed temporary memory in cumprod cpu grad kernel. (#8369)
-
The
embedding
operator supported amp, improved the performance under normal path, and fixed the bug that the gather cpu kernel memory out of bounds. (#8374) -
Optimized the performance of
Tensor.fill_
. (#8283) -
Greatly optimized the performance of the broadcast element-wise binary family operators in reverse calculation. (#8339)
-
Added fusion operator BinaryCrossEntropyWithLogitsReduceMean. (#8476)
-
Added high-performance matrix multiplication Fused kernel based on cublasLt. (#8462, #8222, #8063)
Primitive
- Lowered the elementwise.cuh template's requirement for pointer alignment.
Improvements
Graph
-
Exported oneflow env to python and used python's objects to manage its lifecycle. (#7792)
-
Used Python's reference counting to control the life cycle of Graph and constructed strict and rich destruction test cases. (#7857)
-
Supported recycling independent threads that can no longer be reused when Graph is destructed. (#7862)
-
Changed the basic configuration of resource from one-time static effect to real-time effect. (#8444)
-
Consolidated the nccl_comm dynamically created by the Graph NCCL logical kernel into the runtime for initial creation to avoid the deadlock caused by the inconsistency between the creation order of each rank and the eager nccl comm creation order. (#8263)
-
Refactor optimization: Merged
nn.graph.util.IONode
,nn.graph.util.IONodeType
into IOArgs. (#8272) -
Refactor optimization: Renamed the global singleton Global object to the Singleton object. (#8490)
-
Refactor optimization: Removed gpu_device_num (#8516)
-
Refactor optimization: Removed outdated AvailableMemDesc concepts. (#8145)
-
Refactor optimization: Removed outdated Model IO Kernel logic. (#8151)
-
Refactor optimization: Replaced GpuDeviceNum with the actual number of devices to avoid coupling with specific device types. (#8166)
Eager
-
C++ is available now. You can manually trigger allocator gc on each stream (applicable in ZeRO)(https://github.com/Oneflow-Inc/oneflow/pull/8452)
-
The execution of Eager VirtualMachine instruction is based on the execution of EP. (#7923)
-
Optimized and removed all redundant interfaces of
Get(Ptr)OrThrow
. (#7812) -
Added the validity check of
flow.save(global_dst_rank)
. (#7964) -
Supported the backward function node to run multiple times if it does not capture any tensor. (#8031)
-
Added the
ThreadLocalCached
decorator to clear the cache in time to alleviate increasing memory. (#7858) -
Added std for C++14::inclusive_scan/std::exclusive_scan implementations. (#8128)
-
Packaged the parameters required by the eager opkernel and pass them in each thread to solve some thread-unsafe problems. (#7617)
-
Eager Stream supports kernel computation on pinned memory. (#8486)
-
Introduced a tool class for dim range check to replace simplified Functor's various checking logic for dimensions. (#8382)
-
Refactoring and optimization: removed the Blob object in EagerBlobObject, which leads to redundant TensorView instructions. At the same time, in order to support ShapeView efficiently, the elem_cnt attribute has also been removed. (#7895)
-
Refactoring and optimization: extracted the algorithm used by BinAllocator to share dynamic memory pools
-
Refactoring and optimization:
VectorAt
andMapAt
functions uniformly use reference to pass parameters to solve the mixed use of reference interface and pointer interface. (#8191) -
Refactoring and optimization: removed the cfg application on C++. (#8158)
-
Refactoring and optimization: removed the outdated code related to RemoteBlob in Single-Client. (#8228)
-
Refactoring and optimization: merged duplicate logic in eager boxing ccl and nccl boxing expr. (#7930)
-
Refactoring and optimization: removed cfg on Python and reduced the number of symbols to optimize the link speed of compilation.
-
Refactoring and optimization: merged
symbol::IdCache
andsymbol::Storage
. (#8331) -
Refactoring and optimization: introduced
llvm::SmallVetor
and usedoneflow::small_vector
instead offixed_vector
. Besides, we have optimized the implementation and usage of Shape and Stride. (#8365 , #8402) -
Refactoring and optimization: refactored ShapeView and Shape to eliminated duplication and inconsistencies. (#8422)
-
Refactoring and optimization: eager VirtualMachine has decoupled InstructionType's dependency on StreamType. (#7607)
-
Refactoring and optimization: removed the InstructionMsg class and merged all its functions and fields into the Instruction class. (#7623)
Operators & Tensor
-
Stride support:
-
View support and optimization:
-
Added a new input tensor to decide whether to support non-contiguous when making op definitions. Besides, we now support
transpose
,permute
,narrow
,expand
,expand_as
,split
,chunk
,unfold_tensor
,movedim
,as_strided
,select
,swapaxes
,T
,t
,hsplit
,vsplit
,tensor_split
none-contiguous view ops.(#7813) -
Tensor slice used view operations by default.(https://github.com/Oneflow-Inc/oneflow/pull/8302)
-
-
Automatically generated version status (Feature Stage) for OneFlow's API. (#7945)
-
Optimized CUDA memset to
cudaMemsetAsync
(https://github.com/Oneflow-Inc/oneflow/pull/7763) -
LeakyReLU
supported inplace optimization. (#8060) -
Added the following parameters to
nn.Embedding
interface:padding_idx
,max_norm
,norm_type
,scale_grad_by_freq
. (#8110) -
Aligned PyTorch's
max_pool_1d
,max_pool_2d
,max_pool_3d
,avg_pool_1d
,avg_pool_2d
,avg_pool_3d
, and distinguish old pooling kernel aligned with TensorFlow. (#8111) -
VectorAt supported passing in non-const references:
JUST(VectorAt(vec, 1)) = 5;
. (#8013) -
Reduced the uncommon kernel template specializations of layer norm. (#8209)
-
Modified the logic of
Tensor.numpy
to avoid extra memory growth when saving the model. (#8449) -
Tensor str supported printing nd sbp. (#8458)
-
Slice supported SBP infer (S->P), and the semi-automatically deduced sbp was able to selecte the same sbp as expected in the reducible nd_sbp. (#8536)
-
When printing non-CPU and non-CUDA tensor, you must copy to cpu first and then print. (#8548)
-
Refactoring and optimization: decoupling user kernel and device tag. (#8529)
-
Refactoring and optimization: a series of kernels (
squeeze
,reshape_like
,flatten
,expand_dims
,reshape
,amp_white_identity
,identity
,identity_buffer
,parallel_cast
,hierarchical_parallel_cast
,hierarchical_parallel_cast_like
) were refactored to CopyDataContentKernel #8537 -
Refactoring and optimization: removed obsolete
constant_pad1d
,constant_pad2d
,constant_pad3d
kernel. (#8113) -
Refactoring and optimization: removed obsolete old lazy
upsample
kernel implementation.(#8188) -
Refactoring and optimization: removed obsolete message in shape proto and used sequential to represent stride. (#8220)
-
Refactoring and optimization: removed obsolete multiply kernel, whick was included in
broadcast_mul
. (#8359) -
Refactoring and optimization: Renamed the shape in UserOp/Kernel to shape_view interface. (#8433)
-
Refactoring and optimization: removed oneflow gemm. (#8499)
-
Optimized the Maybe return type of such interfaces as Scalar.As(). (#8348)
Device
-
Code refactoring
ep::CpuDevice
(#7911) -
Code refactoring: removed hard-coded special decision for device type like "cpu", "cuda" from system code. (#8201)
-
Removed all dnn-related interfaces from the old version of KernelUtil (Primitive will be used to replace those interfaces). (#8141)
-
Removed all interfaces related to mathematical calculation in the old version of KernelUtil (Primitive will be used to replace those interfaces). (#8157)
-
Removed incomplete special decision for 'cuda 'device type in scope util. (#8173)
-
Achieved delayed capture of CUDA Graph(#8474)
-
Code refactoring: removed cuda_event. (#8493)
-
Code refactoring: removed useless WITH_CUDA macro. (#8562)
Tests
Eager Global Module Tests:
In 0.8.0, we have completed the ability of all kernels to deal with global tensor in distributed situation, and fixed many known bugs related to sbp. The global tensor worked efficiently and correctly at the kernel level. No matter how the distributed topology structure changes, the same algorithm logic can efficiently get mathematically consistent results, which greatly reduced the trouble of verifying correctness in the complex, diverse and asymmetric distributed parallel training process.
EP::Primitive
Completed some unit tests of Primitive log_softmax
, softmax
, copynd
, Memset
, Memcpy
, matmul
, add
, binary, unary, matmul
, batch_matmul
, fill etc. (#8132, #8139, #8137, #8109, #8143, #8108, #8154, #8154, #8118 , #8291)
Exception
Improve exception error handling
-
Added
reshape
exception handling. (#7847) -
Improved the error message of module when the input information does not match. (#7918)
-
Added the
MAYBE_NEED_ERROR_MSG_CHECK
environment variable to check whether the CHECK function of Maybe contains oneflow:: Error message. It is used to prompt developers to add error prompt message. (#7955) -
Improved the exception error message of
gather
op.(#7979) -
Improved
LayerNorm
error message. (#8090) -
Optimized the error message when Eager and Graph encounter multiple inconsistent input placement in op. (#8054)
-
Improved the error message checking in activation-related kernel processing logic.(#8080)
-
Improved the error message in
tensor.to_global
andtensor.to_local
. (#8067) -
Improved the exception error message in the
dot
kernel. (#8051) -
Rewrited the exception check in
batch_matmul
kernel. (#8186) -
Fixed the problem of exception error checking when Python parses arg. (#8205)
-
Improved the exception error checking logic of all array functor. (#8116)
-
Improved the exception error checking logic of all binary functor. (#8161)
-
Improved the exception error reporting logic in nn grad functor. (#8210)
-
Added error message when Graph.build is not reloaded. (#8250)
-
Added TypeError type and device-related error message. (#8057)
-
Improved the error message of Eager SliceBoxing. (#8232)
-
Improved the error message of broadcast op. (Improve the error message of broadcast op)
-
Improved the error message of Eager Boxing when it is at runtime. (#7926)
-
Improved the error message of Tensor index. (#8234)
-
Improved the error message in nn.functor. (#7910)
-
Added check for Physical Shape when Graph compiles exec_graph. (#8002)
-
Added default error message for CUDA check. (#8427)
-
Added similar error checking information to add n calculation. (#8495)
-
Improved the error message of arg sort. (#8513)
-
Improved the error message of bias add. (#8524)
-
Improved the error message in autograd function. (#8496)
-
Improved the error message of batch gather. (#8533)
-
Improved the error message prompt of defense code in autograd. (#8525 , #8541)
Build
-
Supported CUDA 11.5, 11.6. (ttps://github.com//pull/7852 , #8423)
-
Fixed the version of click at 8.0.0. (#7967)
-
Updated nccl version to 2.12.10. (#7822)
-
Default alignment pytorch version 1.10.0. (#7019)
-
Updated tvm oneflow frontend dependencies. (#8048)
-
Updated the version of LLVM/MLIR to support IREE. (#8068 , #8461)
-
Fixed the version of protobuf between 3.9.2 to 4.0. (#8198)
-
Removed the cfg tool in cmake. (#8218)
-
The environment variable of CMAKE INTERPROCEDURAL OPTIMIZATION was enabled by default. (#8237)
-
Removed the XRT part in the OneFlow source code, and the OneFlow-XRT will be used as a third-party plugin for oneflow. (#8273 ,#8288)
- read more: https://github.com/Oneflow-Inc/oneflow-xrt
-
Changed Liboneflow to dynamic library. (#8312)
-
Updated the version of clang-tidy to 14.0.4. Supports the following syntax now: NOLINT, NOLINTNEXTLINE, NOLINTBEGIN & NOLINTEND. (#8306)
-
Removed
EXTERNAL_INCLUDE_DIRS
, only builds with target. (#8421) -
Removed obsolete linkages in cmake. (#8426)
CI
Improve the running speed and stability of CI
-
Supported CI to automatically upload built docs.(#7894 #7917)
-
Added CI test for IREE. (#8419)
-
Printed the pip package in the container used to test in order to query version information easily. (#7952)
-
Optimized the memory used by AutoTest. (#7988)
-
Adjusted the threshold of benchmark. (#8043)
-
Adjusted the timeout threshold. (#8103)
-
Optimized the warning output related to
__del__
in CI. (#8049) -
Optimized the interval of gc to improve the test speed. (#8138)
-
Optimized the use of super Tensor in CI unit test to avoid gc too slow and slow down the running speed of CI. (#8177)
-
Optimized the number of CI build to improve the speed of build. (#8229)
-
Optimized CI workflow, stops all workflows when a job fails. (#8255)
-
Increased maximum parallelism 5 -> 10. (#8259)
-
Strict CI timeout-minutes. (#8266)
-
Supported optional multi-machine testing via the
need-test-distributed
tag. (#8372) -
Tried to use a distributed test cache when testing on multiple machines. (https://github.com/Oneflow-Inc/oneflow/pull/8387/files)
-
Optimized the test time of global test. (#8468)
-
Optimized the execution time of test_math_ops, test_loss, test_activation, test_tensor_part1, test_tensor_part2 and other eager test. (#8494)
-
Optimized test_convtranspose, test_einsum, test_sqrt_square_sum in expensive eager test. (#8504)
Models
-
Fixed the speed test for Swin-Transformer. (#7840)
-
Added compatibility tests for
conv_mixer
,densenet
,ghostnet
,googlenet
,inception_v3
,mnasnet
,rexnet
,rexnet_lite
,res2net
,shufflenet_v2
,squeezenet
,convnext
,crossformer
,efficientnet
,levit
,mlp_mixer
,poolformer
,pvt
,res_mlp
,uniformer
,swin_transformer
,senet
and other models. Fixes such compatibility issues as conv2d module padding parameter does not support string; the parameter list of functional.layer_norm is not aligned; meshgrid does not support the input of list[tensor]; adds a interface for tensor.reshape_as. (#7942) -
Fixed the bug of Swin-Transformer dataloader. (#8037)
-
Added single-node 4-Gpus tests for models such as InsightFace in oneflow_face repository. (#8130)
Bug fixes
Graph
-
Fixed the bug of nccl deadlock caused by CUDA kernel asynchronous launch limit for nccl logical kernel in 3-D parallelism. (#7924)
-
Fixed cycle import of scope and session. (#7993)
-
Used log_softmax + nll to make sparse_softmax_cross_entropy ms more stable numerically for calculating subgraphs. (#7987)
-
Fixed the bug that B2P boxing misses TaskEdge lbi. (#8052)
-
Fixed the problem that compilation fails due to eager free tensor is not in nn.Graph's job. (#8114)
-
Fixed the possible problem of SegmentFault caused by BlobDesc. (#8252)
-
Solved the bug of circular import in python 3.6. (#8268)
-
Solved the problem that Graph's input and parameter/buffer tensors fail to handle non-contiguous tensors.(#8281)
-
Solved the potential deadlock caused by inconsistent partial order execution of multiple ranks in 3-D parallelism. (https://github.com/Oneflow-Inc/oneflow/pull/8226)
-
Fixed the bug that Ibverbs failed to start the environment due to incorrect mtu value in special network environment. (#8451)
-
Solved the potential deadlock caused by the partial order execution of each rank when the subsequent subgraph of GradAcc is inserted into the NCCL logical op; at the same time, traverse the subsequent subgraph of GradAcc more comprehensively to solve the problem of missing NCCL op. (#8459)
-
Fixed the bug that NCCL logical kernels does not support bool type. (#8455)
-
Fixed the bug of tensor detach and clone in Graph. (#8498)
Eager
-
Aligned
DataLoader.__next__
interface (#7835) -
Fixed backtracking failure when calculating higher-order derivatives, which is caused by the capturing of forward detached tensors via
AutoGrad
-
Fixed inadequate execution of the semantics of sync by Barrier Instruction (#7702)
-
Fixed memory leak caused by imperfect management of VM instruction count
-
Fixed
getitem
when tensor device id is not in the current rank -
Fixed
global norm
error on gradient calculation for various placements when calling clip grad in pipeline parallelism in eager global mode (#7879) -
Fixed possible int32 arithmetic overflow caused by
Shape.elem_cnt
(#8178) -
Fixed incorrect results produced by
Module.to_global
when introducing parameters (#8187) -
Fixed extra GPU memory usage in
flow.load
andmodule.load_state_dict
(#8301) -
Fixed extra GPU memory usage when Optimizer loads models (#8310)
-
Fixed the error occurs when loading models via
flow.load
in multi nodes (#8314) -
Fixed instability of eager caused by the introduction of callback thread (#8193)
-
Fixed
tensor.from_numpy
interface to avoid memory leak when the input of numpy is non-contiguous tensor (#8391) -
Fixed stack overflow when destructing the deep backward computational graph after recursion (#8056)
Operators & Tensor
Global Tensor
-
Fixed global SBP inference of
unfold
(#7883) -
Fixed global SBP inference of
grid_sample
(#7881) -
Fixed incorrect pass of values in slice boxing kernel in certain cases (#7893)
-
Fixed eager global inplace (#7903)
-
Fixed SBP inference of
upsample
op (#7884) -
Fixed SBP inference of
ScatterAdd
,ScatterUpdate
, andScatterScalarUpdate
(#7807) -
Fixed backward memory error of
partial_fc
with Global Tensor (#8041) -
Added support for S0 in
randperm
and fixed equal local tensors across all ranks in random op in Split (#7571) -
Fixed tensor getitem index error in global (#8153)
-
Fixed SBP inference of
RoiAlign
and added global unit test (#7794) -
Fixed SBP inference of
stack
op (#8181) -
Fixed random initialization in median under CPU global (#8245)
-
Fixed SBP inference of
narrow
op and added global unit test fornarrow
andchunk
(#7750) -
Improved legal SBP list of
batch_matmul
(#8385) -
Fixed NLLLoss’ failure to support model parallelism (#8380)
-
Fixed S->S and S->P inference in Slice Op SBP infer (#8521)
Tensor
-
Fixed the bug occurs when Tensor dim is set to -1
-
Fixed failure for Tensor type to be directly transferred to int and float in Python (#7927)
-
Fixed the bug in
Tensor.is_contiguous
that skips initialization when caching and executes random initialization when getting values (#7785) -
Fixed the bug in Tensor slice view under 1d contiguous (#7898)
-
Fixed incorrect processing of None value by
Tensor.__eq__
(#7938) -
Fixed unaligned memory size in
from_numpy
interface (#7963) -
Fixed incorrect initialization of random seed in Tensor (#7904)
-
Fixed failure of
oneflow.Size
to create Tensor with a specified shape (#8429) -
Aligned
alpha
parameter inTensor.add
(#8140)
Scalar Tensor
-
Fixed failure of
add
to support Scalar Tensor (#7827) -
Fixed failure of
reduce_sum
to support Scalar Tensor (#7866) -
Fixed failure of
one_hot
to support Scalar Tensor (#7975)
Fixed failure of gather
to support Scalar Tensor (#8376)
-
Fixed “memory access out of bounds” error in
dim_scatter
kernel under Scalar Tensor (#8418) -
Fixed failure of start and end parameters in
arrange
op to support Scalar Tensor (#8522) -
Fixed failure of
all
to support Scalar Tensor and 0-Size Tensor (#8547)
0-Size Tensor
-
Fixed failure of
conv
anddeconv
to support 0-Size Tensor (#8001) -
Fixed failure of
cuda_check_numerics
to support 0-Size Tensor (#8050) -
Fixed failure of
expand
andadvanced_index
to support 0-Size Tensor (#8094) -
Fixed the bug occurs when processing 0-Size Tensor in
repeat_interleave
kernel and removed relevant special judge ingather
(#8414) -
Fixed failure of
diag
to support 0-Size Tensor (#8557)
Operators
-
Fixed sorting in
nms
unit test (#7831) -
Fixed torch alignment of beta and threshold interfaces of
softplus
op (#7888) -
Fixed failure of
expand
to support passing tuples as parameters (#7913) -
Fixed computation failure in
randperm
when n is too large (#7908) -
Fixed failure relative to list or tuple in parameter passing in
meshgrid
(#7933) -
Fixed
nn.functional.conv2d
bug that all parameters must be specified (#7892) -
Fixed failure of
rand
andrandn
to support tuple as an input (#7914) -
Fixed the bug occurs in
concat
when inputs are of inconsistent data types (#7921) -
Fixed wrong device id got by generator in certain cases in
randn
,dropout
,randint
,rand
,random_mask_like
, andrandperm
(#7896) -
Fixed inconsistent behaviors of
__shfl_sync
undersm_61
inlayernorm
(#7978) -
Fixed failure of
scatter
op to support negative dim (#7934) -
Fixed the bug in
scatter
op nd update value(#7953) -
Fixed failure of
masked_select
to support certain Broadcast operations in eager mode (#7984) -
Fixed the bug in
PReLU
op when dispatching num_blocks (#8004) -
Fixed misused numpy forced synchronization logic in
index_select
python and transferred the logic to functor for implementation (#7965) -
Aligned dtype parameter in
prod
(#7932) -
Fixed the bug occurs when
ord = 0
inlinalg.vector_norm
op; Fixed check on nan/inf by clip_grad (#8007) -
Fixed failure of
min
andmax
to operate on inconsistent dtypes (#8021) -
Added
num_batches_tracked
buffer tobatch_norm
to facilitate transfer of ResNet-18, a torch pretrained model, to OneFlow (#7920) -
Fixed the misuse of
logf
,expf
, andpowf
in math kernel (#8038) -
Fixed exclusion of dtype parameters in
cumsum
andcumprod
and providedTensor.cumsum
andTensor.cumprod
methods (#8065) -
Fixed possible overflow when dtype is not int64 in
non_zero
op (#7907) -
Aligned
sum
,mean
,all
,any
, andprod
operations inreduce
(#8085) -
Fixed incorrect backward computation in
cumprod
(#8136) -
Aligned
alpha
parameter insub
operation (#8026) -
Fixed shape inference in
upsample
op (#8105) -
Fixed failure of
addn
inplace operation on CPU tensor (#8280) -
Fixed limit on tensor size in
cum
backward op based on the size of shared memory (#8289) -
Improved the logic of dtype inference for
arange
op (#8338) -
Fixed NaN propagation of UnaryFunctor (#8346)
-
Fixed ndim check of
pad
(#8354) -
Fixed vector check in
broadcast_min
andbroadcast_max
backward computations (#8379) -
Fixed the bug relative to index computation logic in
cumprod
op (#8388) -
Fixed possible int32 overflow in
softmax
and math unary / binary cuda kernel; for kernels that operate integer division oni
inCUDA_1D_KERNEL_LOOP
, providedif
statement to branch computations to prevent performance loss in most cases when int32 works (#8472) -
Fixed failure to pass size via
size=(...)
in random ops (normal
,rand
,randn
,randint
, andrandperm
) (#8506)
Device
-
Fixed error in
cudaGetDeviceCount
when CUDA device count=0 (#8184) -
Fixed possible unregistration of devices caused by
hob.ToString
method; Used static local variables to establish dependency between static variables of device registration and the static code for device registration (#8235) -
Fixed
cudaErrorNoDevice
caused by drive errors (#8262) -
Fixed memory leak caused by realpath (#8540)
Higher order derivative
-
Introduced AutogradCapturedTensor in backward computation to avoid circular reference and allow correct backtracking to the input gradient node in higher order derivative graph (#7808)
-
Added higher order derivative of
sin/cos
op; Fixedautograd
bugs relative to higher order derivative (#8163) -
Fixed bugs in backward computation in
concat
andsplit_like
to support higher order derivative (#8208)
Build
-
Fixed RTD [sphinx] failure to build docstr (#7901)
-
Fixed compilation failure caused by opencv copy header failure (#7944)
-
Fixed failure to generate a new
.so
in compilation whenCMAKE_LINK_DEPENDS_NO_SHARED=YES
(#7868) -
Fixed Eigen url in cmake third party (#8223)
-
Fixed the bug caused by multi-time linking to libof_protoobj in XRT (#8326)
-
Made libproto a dynamic library to avoid collision between static global variables (#8345)
-
Made
of_pyext_obj
static only when there is one Python extension dynamic library that has Python symbols (#8393) -
Fixed the bug in
undefined symbol: del_curterm
in source code compilation (#8398) -
Fixed false positive warning in gcc11 compilation (#8401)
-
Fixed SegFault that occurs when unzipping dataset in the container by making zlib a dynamic library (#8481)
-
Fixed undefined reference of culibosTlsSetValue (#8479)
-
Fixed stringop-truncation compilation error for gcc9 (#8532)
CI
-
Disabled static link of Simple CI and enabled debug build to avoid too many symbols (#7940)
-
Fixed the bug in AutoTest fake program; Fixed print error in AutoTest (#8279; #8290)
Module
-
Disabled conv3d test temporarily for its relatively large error of random values (#7969)
-
Reduced test error in nn.LayerNorm (#7941)
-
Optimized input data range of certain math op tests (#8010)
-
Fixed incorrect unit test case in
permute
(#8083) -
Aligned error message of chunk to torch (#8096)
-
Fixed incorrect use of
permute
in tensor tests (#8144) -
Fixed omission of test cases in
instancenorm
(#8215) -
Adjusted unit test threshold for
leaky_relu
(#8242) -
Annotated cpu bn grad method that tests with random values (#8257)
-
Skipped test cases of
global argmax
andmedian
in multi-GPU scenarios (#8264) -
Adjusted unit test threshold for
fused_dot_feature_interaction
(#8293) -
Disabled unit tests for
conv_transpose1d
,conv_transpose2d
, andconv_transpose3d
(#8319) -
Adjusted tolerance setting in embedding_renorm unit test (#8394)
-
Removed test cases with excessive accumulated elements in
test_fused_dot_feature_interaction_pooling_sum
to avoid overly large sum error (#8425)
Documentation
-
Ensured that all PyTorch references in OneFlow API documentation belong to the same PyTorch version (1.10.0) (#8058)
-
Added "copy" button for code in API docs to facilitate trial runs of sample code (#7997)
-
Refined script that automatically generates version status for OneFlow APIs and fixed bugs in docs (#8546)
-
Refined interface documentation of Tensor and Module (#7823)
-
Refined
Tensor.to_global
interface documentation and added descriptions ofgard_sbp
-
Refined
Tensor.to_local
interface documentation -
Added Tensor Attributes docs for
oneflow.placement
,oneflow.env.all_device_placement
, andoneflow.sbp.sbp
-
Added interface documentation for
Module.to_consistent
(outdated) andModule.to_global
-
-
Fixed invalid links in Tensor docs and updated
consistent
toglobal
(#7821) -
Added docstr for
Tensor.sqrt
,Tensor.square
,Tensor.addmm
,Tensor.cosh
,Tensor.diagonal
,Tensor.log
,Tensor.ndim
, andTensor.rsqrt
(#7841) -
Enabled derived classes of pybind11 to add documentation for non-overriding methods and added interface documentation related to Tensor and autograd (#7849)
-
Refined documentation of
oneflow.argsort
(#7844) -
Refined documentation of
Tensor.zero_
,Tensor.is_contiguous
,Tensor.is_cuda
, andoneflow.nn.functional.layer_norm
op (#7839) -
Refined interface documentation of
support_sparse
andstep
inoneflow.optim.Adamw
,oneflow.optim.SGD
(#7848) -
Refined interface documentation of
LambdaLR.step
,ReduceLROnPlateau.in_cooldown
, andReduceLROnPlateau.is_better
(#7848) -
Refined interface documentation of
nn.Module
(#8190) -
Refined interface documentation of
oneflow.optim.lr_scheduler.PolynomialLR
(#8430) -
Refined docs and formula illustrations for
oneflow.nn.CombinedMarginLoss
(#8206) -
Refined documentation of
oneflow.logical_and
,oneflow.logical_or
,oneflow.logical_xor
, andoneflow.logical_not
(#8297) -
Fixed the bug in the documentation of quantization ops (#8333)
-
Updated solution in Troubleshooting for the case when
libunwind.h
is not found (#8336) -
Restructured API documentation based on features; added and refined docs of features that are unique to OneFlow (#8392)