14 Jun 23:42

87d4ab1

v0.4.0

Changelog v0.4.0

Highlights

在这个版本，我们为 OneFlow 新增了大量的功能，0.4.0 是 OneFlow 自开源以来最大的更新。在这个版本中，我们增加了 2-D SBP、流水并行，Checkpoint 的新的接口，以及大量对齐 pytorch 的接口，还支持了 CUDA 11.2。在之前，我们已经开源了 OneFlow 的 GPT 源码，其中大量使用了这个版本的各种新特性，同时也欢迎移步阅读《OneFlow —— 让每一位算法工程师都有能力训练 GPT》这篇文章。

Lazy 模式的功能更新

支持 2-D SBP

转为2维

with flow.scope.placement("gpu", "0:0-3", (2, 2)):
    x = flow.hierarchical_parallel_cast(
        x, parallel_distribution=["B", "S(1)"]
    )

转为1维

with flow.scope.placement("gpu", "0:0-3", (4,)):
    x = flow.hierarchical_parallel_cast(
        x, parallel_distribution=["S(0)"]
    )

支持流水并行的新接口

创建 pipeline_stage 的 scope

with flow.experimental.scope.config(
        pipeline_stage_id_hint=dist_util.get_layer_stage(layer_idx)
    ):
    ...

为了是流水并行能更好的工作，必须使用梯度累加，可以使用有限内存跑更大 batch。通过 config 设置梯度累加的步数：

func_cfg = flow.FunctionConfig()
...
func_cfg.train.num_gradient_accumulation_steps(args.num_accumulation_steps)
@flow.global_function(..., function_config=func_cfg)

支持 ZeRO 优化

开启方式：

func_cfg = flow.FunctionConfig()
...
func_cfg.optimizer_placement_optimization_mode(mode) # mode  = "non_distributed" or "distributed_split"
@flow.global_function(..., function_config=func_cfg)

示例代码请参考这个测试用例
mode = "distributed_split" 对应 DeepSpeed ZeRO 优化的 stage 2

支持 Checkpointing 的新接口

with flow.experimental.scope.config(
    checkpointing=True
):

Eager 模式的功能更新

提供`oneflow.experimental` 命名空间，部分对齐 `torch.xxx` 接口

新接口的使用方法

import oneflow.experimental as flow
flow.enable_eager_execution() # 启用 eager

目前部分对齐的功能

flow.nn.Conv2d  <->  torch.nn.Conv2d
flow.nn.BatchNorm2d  <->  torch.nn.BatchNorm2d
flow.nn.ReLU  <->  torch.nn.ReLU
flow.nn.MaxPool2d  <->  torch.nn.MaxPool2d
flow.nn.AvgPool2d  <->  torch.nn.AvgPool2d
flow.nn.Linear  <->  torch.nn.Linear
flow.nn.CrossEntropyLoss  <->  torch.nn.CrossEntropyLoss
flow.nn.Sequential  <->  torch.nn.Sequential

flow.nn.Module.to  <->  torch.nn.Module.to
flow.nn.Module.state_dict  <->  torch.nn.Module.state_dict
flow.nn.Module.load_state_dict  <->  torch.nn.Module.load_state_dict

flow.save  <->  torch.save
flow.load  <->  torch.load

flow.Tensor  <->  torch.Tensor
flow.tensor  <->  torch.tensor
flow.tensor.to  <->  torch.tensor.to
flow.tensor.numpy  <->  torch.tensor.numpy
flow.tensor 加减乘除  <->  torch.tensor 加减乘除
flow.tensor.flatten  <->  torch.tensor.flatten
flow.tensor.softmax  <->  torch.tensor.softmax

flow.optim.SGD  <->  torch.optim.SGD

基于上述模块已经可以轻松搭建常用网络，如：ResNet、BERT、MobileNetV3 等。后续版本将对齐/支持更多接口，届时可将大多数基于 Pytorch 搭建的网络，轻松切换到 OneFlow。

快速上手例子 lenet: https://github.com/Oneflow-Inc/models/blob/main/quick_start_demo_lenet/lenet.py
新接口文档链接：https://oneflow.readthedocs.io/en/master/experimental.html
对齐 torch vision 的 ResNet50 示例代码：https://github.com/Oneflow-Inc/models/tree/main/resnet50
接下里的几个版本会增加更多对齐 PyTorch 的接口
experimental 下对齐的接口在 0.6.0 版本更新时会被移动到 oneflow 的命名空间下，届时会完全对齐 PyTorch，OneFlow 0.6.0 会将 eager 作为默认的执行方式
eager 模式目前只支持单 GPU 运行，在 0.5.0 会支持多 GPU 运行

其他更新

新的 Python Pip 包名和版本号规则

之前一个 OneFlow 的版本采取的是“不同包名，相同版本名”的规则，如 oneflow_cu102==0.3.4，从 0.4.0 之后将采取“相同包名，不同版本名”的规则，如oneflow==0.4.0+cu102，最新安装方式请参考 README Install with Pip Package章节

支持 CUDA 11.2

stable 版本和 nightly 版本的 OneFlow 都支持了 CUDA 11.2 平台（cu112）

ONNX 模块独立仓库

ONNX 模块目前在新仓库 https://github.com/Oneflow-Inc/oneflow_convert_tools 中维护，OneFlow 主仓库中的 ONNX 相关的代码将在下个版本移除，具体细节可以看《深度学习框架OneFlow是如何和ONNX交互的？》一文。oneflow_convert_tools 目前是针对 OneFlow 的 lazy 模式开发，目前最新版本号为 v0.3.2，后面针对 eager 模式的 oneflow_convert_tools 版本号将从 0.4.0 开始

"下集预告"

在下一个版本的 OneFlow 中，将包含更全面的 PyTorch 兼容，包括更多更丰富的接口支持以及多 GPU 支持。同时，下个版本的 OneFlow 也将支持动静图转换的功能。敬请期待！

Assets 2

15 Jan 07:57

jackalcooper

v0.3.4

50b83a5

Hotfix v0.3.4

bump version 0.3b5

Former-commit-id: 10e43bf85c32b791818a53bc871ceddc02a48a78

Assets 2

13 Jan 15:09

jackalcooper

v0.3.3

1920e02

v0.3.3

Op 修复和性能优化

[enhancement][op] reduce sum half kernel #4110
[enhancement][op] simplify cosface #4107
[enhancement][op] indexed_slices update support weight_decay #4096
[enhancement][op][python] Migrate swish and mish namespace from math to nn #4104
[enhancement][op] Add elementwise maximum/minimum ops #4069
[enhancement][op] Fix Code format warning in hardswish #4105
[enhancement][feature][op] Add Scalar Pow #4082
[bug][op] Fix bug: mut_shape_view of static output maybe null in UserKernel::ForwardShape #4094
[enhancement][op][refactor] Migrate cast_to_static_shape to user op #4095
[feature][op] Add GroupNorm op #4089
[feature][op] Distributed partial sampler #3857
[enhancement][op][python] add Relu6 activation #4029
[bug][op] Rename ont_hot_op.cpp to one_hot_op.cpp #4093
[bug][op][python] fix hardtanh CI precision error #4091
[enhancement][op] add remove_img_without_anno api for COCOReader #4088
[enhancement][op] Add Hardtanh activation #4049
[enhancement][op] Add ELU activation #4065
[enhancement][op][python] Update logsoftmax.py #4041
[documentation][op] Fix in_top_k api document #4079
[enhancement][op] Add Hardswish activation #4059
[enhancement][op][python] Add hard sigmoid #4043
[enhancement][op] Dev in top k #3611
[bug][op] Fix argwhere tmp buffer infer #4061
[enhancement][op] Optimize softmax cuda kernel #4058
[feature][op] Add InstanceNorm 1d & 3d implementation #4052
[feature][op] Quantization aware training releated ops #3764
[enhancement][op] Generic unfold kernel implementation #4033
[enhancement][op] User op dim_gather support dynamic input and index #4039
[enhancement][op] Reflection pad2d op #3777
[enhancement][op] slice support empty blob #4025
[bug][enhancement][op] Migrate argwhere to user op #4021
[bug][op] Dev rm old tanh #4035
[enhancement][op][refactor] Make MaxWithLogThreshold and SafeLog header only #4030
[op][purge] Tidy up op_conf.proto #3932
[enhancement][op][python] Dev bcewithlogits loss #4024
[feature][op] Add implementation of InstanceNorm2D op #4020
[enhancement][op][refactor] Refactor gpu_atomic_add #4027
[enhancement][op][python] add kldivloss #4012
[enhancement][op][python] Dev oneflow ones #3990
[enhancement][op] Add flatten/squeeze/expand_dims to auto mixed precision clear list and use reshape instead of reshape_like to do reshape grad computation #4015
[enhancement][op][python] add pixel shuffle #4003
[enhancement][op] Scalar kernels use element-wise template #4013
[enhancement][op][python] add zeros api #3991
[enhancement][op] Optimize ComputeEntropyGpu with CUB #3930
[feature][op] CUDA template for element-wise kernels #4007

系统组件

[enhancement][system] migrate job_build_and_infer api to pybind11 #3940
[feature][system] quantization aware training pass #3817
[eager][enhancement][system] Mig op arg para attr #4102
[feature][system] Tensor Float 32 Support. #4072
[enhancement][system] Mig op arg para attr #4090
[enhancement][system] Mig py cfg sbp #4086
[enhancement][system] Refactor python remote blob #4081
[enhancement][system] remove BlobDef #4071
[bug][system] Fix warning: moving a local object in a return statement prevents copy elision #4067
[enhancement][system] Refactor python blob desc #4063
[feature][system] Add nvtx range and thread naming #4064
[documentation][enhancement][system] Add docs on installing legacy versions of oneflow #4056
[bug][system] support eager empty blob #4047
[enhancement][system] Add err info for ncclGroupEnd check #4048
[enhancement][system] Optimize dynamic loss scale parameters #4045
[purge][system] Remove col_id #4046
[enhancement][system] Scope with symbol #4040
[enhancement][system] Job desc with symbol #4032
[enhancement][system] Parallel desc with symbol #4017
[bug][system] change sbp order value for layer norm #3995
[bug][system] Fix eager test_resume_training test #4023
[bug][system] Fix python cfg error bug #4018
[bug][system] Remove redundant pack_size in GenericLauncher #4014
[enhancement][system] Set default block size to 512 #4011
[feature][system] Remove swig in oneflow #3969
[feature][system] Migrate oneflow internal api to pybind11 #3953
[build][enhancement][system] Bump nccl from 2.7.3 to 2.8.3 #3875

Eager 模式

[bug][eager] Fix eager bug of test split like #4004
[bug][eager] add float16 datatype for eager boxing #4092

Python 前端

[feature][python] add stack #3897
[bug][enhancement][python] Fix test kldivloss tolerance #4103
[bug][enhancement][python] Fix "hardsigmoid" eager test error #4085
[bug][documentation][python] Add hardsigmoid #4076
[api][enhancement][python] add deprecate api optimizer.PolynomialSchduler #4038

工具链

[feature][tooling] split_cfg_cpp_and_pybind_generator #4002
[enhancement][tooling] Cfg hash #4084
[enhancement][tooling] Finetune cfg tool #4050
[enhancement][tooling] optimi...

Assets 2

16 Dec 11:07

jackalcooper

v0.3.2

119ce9b

v0.3.2

Changelog

v0.3.2 (16/12/2020)

[enhancement][system] Migrate foreigns to pybind11 #3939
[feature][op][python] add swish activation #3970
[bug][op] fix argwhere format #4010
[enhancement][op] Argwhere support empty blob #4009
[feature][op][python] add mish activation #3972
[bug][eager] Fix eager memory leak and re-enable new checkpoint #4008
[ci][enhancement] upload bin to oss #4000
[enhancement][op] Fuse cast scale #3999
[enhancement][op] layer_norm_grad_add_to_output #3998
[enhancement][system] Optimize NcclCollectiveBoxingExecutorBackend::ExecuteGroup latency #3997
[feature][system] OptimizerPlacementOptimization #3944
[enhancement][op] Dev optimize prelu #3987
[api][enhancement][op] Switch identity to user op and add it to auto mixed precision clear list #3992
[enhancement][op] Optimize slice kernel #3989
[bug][op] Hotfix: add parallel cast to amp clear list #3988
[bottleneck][enhancement][system] Sublinear memory cost by checkpointing #3976
[enhancement][system] Add gradients stats aggregation #3979
[feature][system] nccl enable mixed fusion #3981
[enhancement][op] fused_scale_tril / hot fix matmul / softmax broadcast_sub broadcast_div #3980
[bug][op] add combined margin cpu and fix bug #3961
[feature][op] Add multi_square_sum op #3977
[bug][op] fix pad op #3971
[ci][enhancement][test] larger tol for bn #3965
[cfg][enhancement][python] Dev replace py job conf proto to cfg #3856
[enhancement][refactor][ssp] Dev ssp fix fuse and add just #3959
[cfg][enhancement][refactor][tooling] replace ScopeProto to cfg #3816
[feature][op] TripOp add fill value #3960
[enhancement][system] remove serialized in python callback #3891

Assets 2

02 Dec 08:59

jackalcooper

v0.3.1

97300fa

v0.3.1

Changelog

v0.3.1 (02/12/2020)

[bug][system] Fix CollectiveBoxingGenericTaskNode::ProduceAllRegstsAndBindEdges #3946
[bug][op] Fix constant init value #3947
[api][enhancement][refactor][tooling] Refine custom op build #3925
[feature][op] add combined margin loss #3819
[enhancement][tooling] default show cpp error stack frame #3948
[cfg][enhancement][tooling] Dev replace py parallel conf proto to cfg #3810
[feature][system] Add NaiveB2PSubTskGphBuilder #3942
[bug][system] disable new checkpoint by default temporarily #3943
[bug][system] Explicitly specify the SBP in NonDistributedOptimizerPass #3937
[bug][op] indexed_slices_model_update handle empty tensor #3933
[bug][ci] fix oss list file 100 limit #3935

Assets 2

09 Oct 06:57

jackalcooper

v0.2.0

f3d736e

v0.2.0

Changelog

v0.2.0 (09/10/2020)

Op 修复、性能优化

支持二元 add op 与前驱节点融合

FuseAddToOutput #3524
Dropout support add_to_output #3569
Dev matmul add to output #3581

kernel 性能优化

Fused BatchNormAddRelu #3519
bn_add_relu use bit mask #3645
layer_norm param grad #3604
Fused layer norm #3591
BiasAdd Row Col Half2 #3636
MaskAndScaleHalf2 #3643
Optimize CudaAsyncMemoryCopier #3543
Avoid using local memory in CropMirrorNormalizeGpuKernel #3539
LayerNormGpuKernel use fused InstanceScaleCenter #3573

使用 user op 实现 model update ops，以及 model update ops 支持 fusion

Add model update user ops #3546
Migrate L1L2RegularizeGradientOp to UserOp Framework #3527
model update fuse scalar_mul_by_tensor #3635
Dev indexed slices model update user ops #3561
Dev adam xla and rm sys op #3584

NCCL 支持设置最大融合 op 数量

Add nccl_fusion_max_ops #3567

新 op

[feature] Fused ImageDecoderRandomCropResize #3644
Add AmpWhiteIdentityOp #3658
Add ImageDecoderRandomCropResizeOp::InferParallelSignature #3646
Dev add op tril #3511
add masked fill op #3515

cuDNN 算法推导支持全局缓存

Add CudnnConvAlgoCache #3649

Bugfix 与其他

fix broadcast div grad #3525
fix optimizer copy-paste bug #3508
fix bug about pad value #3640
Optimize some default values #3648
Fix cuda runtime #3621
Fix reshape inplace #3545
Refactor rmsprop mean_square and add unit tests for optimizers #3523
Remove cuDNN fields from OperatorConf #3536
Add UserOpConfWrapperBuilder::ScopeSymbolId #3528
Fix NcclCollectiveBoxing builder_name #3563
rm conv2d cpu testcase #3574
fix broadcast_to_compatible_with grad bug #3609
Add inline for half #3600
Fix converter half #3599
Fix gpu_atomic_max double overload use fmaxf #3578
fix upsample #3579

Eager Execution

给eager相关的代码加上更多注释；微调stateless_call指令，区分mutable_input和output两类不同的参数；实现broadcast指令；

fix fmt cuda_copy_d2h_stream_type #3606
add comments for cuda_copy_d2h_stream_type.cpp #3603
Fix TopoForEachNode in GenCollectiveBoxingPlan #3566
Split call_op_kernel instruction args into const_input/mutable_input/output #3562
split BlobObject and EagerBlobObject #3485
remove unused code under vm/ #3585
Dev broadcast instruction #3555
Broadcast instruction #3552

pybind11 集成

现在 OneFlow 内 SWIG 和 pybind11 共存，之后会逐步切换到 pybind11

pybind11 integration #3517
upgrad to pybind11 master and pass exe path #3522
Update rel script for pybind11 #3526
Dev oneflow pybind api #3625

优化、修复编译工具

修复了一些导致编译失败缓慢的不合理配置、加速了依赖下载、修复了 ubuntu dockerfile

[bug] fix ubuntu docker build #3504
change link order to fix the cpu+openblas build #3634
[bug] fix bug: oneflow cpu-only lib flags #3615
add convert_url_to_oss_https_url and DCN flag #3595
Add cn url in readme #3583
make absl use tar not git #3570
Optimize nvcc gencode flag #3577

Transport 网络传输子系统

支持 P2P 动态网络传输

[feature] Transport #3549

集成 CFG 工具

CFG 是基于 proto 语法的、生成跨 python、C++ 数据交互代码的工具

Dev integrate cfg #3597
Less usage of PbMessage in Operator #3651

XLA 支持优化

升级到了 TF 最新版本

upgrade XRT XLA to TF 2.3.0 #3531
Fix XLA crash #3548

GRPC 升级

升级到了 GRPC 最新版本

Upgrade grpc #3551
[bug] [bugfix] GRPC: control server CompletionQueue shutdown. #3589

CI、测试优化

将 XLA 也加入 CI，优化了 op 的测试用例，自动上传 master 最新 commit

Parallel unit tests (Step 1, refactor existing unit tests) #3632
Add build type for pr oss upload #3627
XLA ci support #3564
Auto upload tar to aliyun oss #3592
Don't pack source code if it is not master #3593
move fmt to github hosted #3559
refactor ci #3557
CtrlTest find available port for ctrl port instead of handwriting #3610

ONNX 支持

优化 IR，更新测试脚本

onnx update #3495

增加、修订文档

Add api docs zzk #3505
Add api docs zzk #3533
Add api docs zzk #3514
fix masked_fill op doc #3560

Python 前端修复

Fix the bug of using op_module_builder in namespace scope #3513
Comment release global for now to avoid random crash in python #3629
update lib name in link flags #3623
rm spaces in rm_spaces optimizer.py #3619

优化、修复系统通用组件

[enhancement] flat ErrorProto error_type #3474
[enhancement] Added user_op_conf getter for BatchAxisContext/KernelInitContext/SbpContext #3506
[bug] Fix UserOpConfWrapper::has_input/has_output #3507
support reflecting cfg message #3655
Refactor scope #3652
Refactor placement scope #3650
Bugfix split config proto and session job set #3637
[Bug fix] Release global variables #3624
Add OpRegistry::SetAreaId #3608
Dev converter #3580
Tensor::dptr support half #3582
Use InferOutBlobDescsIf instead of InferBlobDescsIf in InferOpNodeLogicalBlobDesc #3535
Add ctrl_in_op_name only when unreachable #3537

Assets 2

13 Sep 10:34

jackalcooper

grpc0

6c6dfcc

grpc0: change openssl, cares install path Pre-release

Pre-release

Former-commit-id: 86c49d1163d3b822ca076201d8753d15cf895969

Assets 2

10 Sep 11:28

jackalcooper

v0.2b1

99845bd

version 0.2b1 Pre-release

Pre-release

Former-commit-id: 349289677b3fcd6fd51dd488dda8bf270aa563bf

Assets 2

08 Sep 04:08

jackalcooper

v0.2b0

b72bcec

0.2 beta 0 Pre-release

Pre-release

fix version

Former-commit-id: 65069051da5b822fcc8510c0aaa0c8189e00b61d

Assets 2

07 Sep 14:28

jackalcooper

v0.1.11b1

73603a0

0.1.11 beta1 Pre-release

Pre-release

upgrade XRT XLA to TF 2.3.0 (#3531)

* compile tf 2.3.0 with gcc 7.3

* fix oneflow eigen

* minor fix

* fix include

* update protobuf if xla is on

* update path of tf proto generated cpp files

* fix path in script

* add .clangd to git ignore

* update xla ifs

* update scripts

* update path in script for clangd

* add gitignore

* add cmake flag XRT_TF_URL

* rm comment

* check in changes

* bash tricks to enable gcc 7.3

* use arg to control tuna

* bumpversion

* fix build wheel

* use real path

* add dir for cpu

* fix unwanted yum update cublas

* uncomment all

* rm suffix of wheelhouse_dir

* add log info

Co-authored-by: tsai <caishenghang@1f-dev.kbaeegfb1x0ubnoznzequyxzve.bx.internal.cloudapp.net>
Co-authored-by: tsai <[email protected]>
Former-commit-id: da12e8db4f52d3c5351f0e43f3677dd948d3801d

Assets 2

Releases: Oneflow-Inc/oneflow

v0.4.0

Changelog v0.4.0

Highlights

Lazy 模式的功能更新

支持 2-D SBP

支持流水并行的新接口

支持 ZeRO 优化

支持 Checkpointing 的新接口

Eager 模式的功能更新

提供oneflow.experimental 命名空间，部分对齐 torch.xxx 接口

其他更新

新的 Python Pip 包名和版本号规则

支持 CUDA 11.2

ONNX 模块独立仓库

"下集预告"

Hotfix v0.3.4

v0.3.3

Op 修复和性能优化

系统组件

Eager 模式

Python 前端

工具链

v0.3.2

Changelog

v0.3.2 (16/12/2020)

v0.3.1

Changelog

v0.3.1 (02/12/2020)

v0.2.0

Changelog

v0.2.0 (09/10/2020)

Op 修复、性能优化

支持二元 add op 与前驱节点融合

kernel 性能优化

使用 user op 实现 model update ops，以及 model update ops 支持 fusion

NCCL 支持设置最大融合 op 数量

新 op

cuDNN 算法推导支持全局缓存

Bugfix 与 其他

Eager Execution

给eager相关的代码加上更多注释；微调stateless_call指令，区分mutable_input和output两类不同的参数；实现broadcast指令；

pybind11 集成

现在 OneFlow 内 SWIG 和 pybind11 共存，之后会逐步切换到 pybind11

优化、修复编译工具

修复了一些导致编译失败缓慢的不合理配置、加速了依赖下载、 修复了 ubuntu dockerfile

Transport 网络传输子系统

支持 P2P 动态网络传输

集成 CFG 工具

CFG 是基于 proto 语法的、生成跨 python、C++ 数据交互代码的工具

XLA 支持优化

升级到了 TF 最新版本

GRPC 升级

升级到了 GRPC 最新版本

CI、测试优化

将 XLA 也加入 CI，优化了 op 的测试用例，自动上传 master 最新 commit

ONNX 支持

优化 IR，更新测试脚本

增加、修订文档

Python 前端修复

优化、修复系统通用组件

grpc0: change openssl, cares install path

version 0.2b1

0.2 beta 0

0.1.11 beta1

提供`oneflow.experimental` 命名空间，部分对齐 `torch.xxx` 接口

Bugfix 与其他

修复了一些导致编译失败缓慢的不合理配置、加速了依赖下载、修复了 ubuntu dockerfile