Skip to content

Releases: Oneflow-Inc/oneflow

v0.4.0

14 Jun 23:42
Compare
Choose a tag to compare

Changelog v0.4.0

Highlights

在这个版本,我们为 OneFlow 新增了大量的功能,0.4.0 是 OneFlow 自开源以来最大的更新。在这个版本中,我们增加了 2-D SBP、流水并行,Checkpoint 的新的接口,以及大量对齐 pytorch 的接口,还支持了 CUDA 11.2。在之前,我们已经开源了 OneFlow 的 GPT 源码,其中大量使用了这个版本的各种新特性,同时也欢迎移步阅读《OneFlow —— 让每一位算法工程师都有能力训练 GPT》这篇文章。

Lazy 模式的功能更新

支持 2-D SBP

  • 转为2维
    with flow.scope.placement("gpu", "0:0-3", (2, 2)):
        x = flow.hierarchical_parallel_cast(
            x, parallel_distribution=["B", "S(1)"]
        )
  • 转为1维
    with flow.scope.placement("gpu", "0:0-3", (4,)):
        x = flow.hierarchical_parallel_cast(
            x, parallel_distribution=["S(0)"]
        )

支持流水并行的新接口

  • 创建 pipeline_stage 的 scope
with flow.experimental.scope.config(
        pipeline_stage_id_hint=dist_util.get_layer_stage(layer_idx)
    ):
    ...
  • 为了是流水并行能更好的工作,必须使用梯度累加,可以使用有限内存跑更大 batch。通过 config 设置梯度累加的步数:
func_cfg = flow.FunctionConfig()
...
func_cfg.train.num_gradient_accumulation_steps(args.num_accumulation_steps)
@flow.global_function(..., function_config=func_cfg)

支持 ZeRO 优化

  • 开启方式:
func_cfg = flow.FunctionConfig()
...
func_cfg.optimizer_placement_optimization_mode(mode) # mode  = "non_distributed" or "distributed_split"
@flow.global_function(..., function_config=func_cfg)
  • 示例代码请参考这个测试用例
  • mode = "distributed_split" 对应 DeepSpeed ZeRO 优化的 stage 2

支持 Checkpointing 的新接口

with flow.experimental.scope.config(
    checkpointing=True
):

欢迎阅读相关文章:亚线性内存优化—activation checkpointing在oneflow中的实现

Eager 模式的功能更新

提供oneflow.experimental 命名空间,部分对齐 torch.xxx 接口

  • 新接口的使用方法

    import oneflow.experimental as flow
    flow.enable_eager_execution() # 启用 eager
  • 目前部分对齐的功能

    flow.nn.Conv2d  <->  torch.nn.Conv2d
    flow.nn.BatchNorm2d  <->  torch.nn.BatchNorm2d
    flow.nn.ReLU  <->  torch.nn.ReLU
    flow.nn.MaxPool2d  <->  torch.nn.MaxPool2d
    flow.nn.AvgPool2d  <->  torch.nn.AvgPool2d
    flow.nn.Linear  <->  torch.nn.Linear
    flow.nn.CrossEntropyLoss  <->  torch.nn.CrossEntropyLoss
    flow.nn.Sequential  <->  torch.nn.Sequential
    
    flow.nn.Module.to  <->  torch.nn.Module.to
    flow.nn.Module.state_dict  <->  torch.nn.Module.state_dict
    flow.nn.Module.load_state_dict  <->  torch.nn.Module.load_state_dict
    
    flow.save  <->  torch.save
    flow.load  <->  torch.load
    
    flow.Tensor  <->  torch.Tensor
    flow.tensor  <->  torch.tensor
    flow.tensor.to  <->  torch.tensor.to
    flow.tensor.numpy  <->  torch.tensor.numpy
    flow.tensor 加减乘除  <->  torch.tensor 加减乘除
    flow.tensor.flatten  <->  torch.tensor.flatten
    flow.tensor.softmax  <->  torch.tensor.softmax
    
    flow.optim.SGD  <->  torch.optim.SGD

    基于上述模块已经可以轻松搭建常用网络,如:ResNet、BERT、MobileNetV3 等。后续版本将对齐/支持更多接口,届时可将大多数基于 Pytorch 搭建的网络,轻松切换到 OneFlow。

  • 快速上手例子 lenet: https://github.com/Oneflow-Inc/models/blob/main/quick_start_demo_lenet/lenet.py

  • 新接口文档链接:https://oneflow.readthedocs.io/en/master/experimental.html

  • 对齐 torch vision 的 ResNet50 示例代码:https://github.com/Oneflow-Inc/models/tree/main/resnet50

  • 接下里的几个版本会增加更多 对齐 PyTorch 的接口

  • experimental 下对齐的接口在 0.6.0 版本更新时会被移动到 oneflow 的命名空间下,届时会完全对齐 PyTorch,OneFlow 0.6.0 会将 eager 作为默认的执行方式

  • eager 模式目前只支持单 GPU 运行,在 0.5.0 会支持多 GPU 运行

其他更新

新的 Python Pip 包名和版本号规则

之前一个 OneFlow 的版本采取的是“不同包名,相同版本名”的规则,如 oneflow_cu102==0.3.4,从 0.4.0 之后将采取“相同包名,不同版本名”的规则,如oneflow==0.4.0+cu102,最新安装方式请参考 README Install with Pip Package章节

支持 CUDA 11.2

stable 版本和 nightly 版本的 OneFlow 都支持了 CUDA 11.2 平台(cu112)

ONNX 模块独立仓库

ONNX 模块目前在新仓库 https://github.com/Oneflow-Inc/oneflow_convert_tools 中维护,OneFlow 主仓库中 的 ONNX 相关的代码将在下个版本移除,具体细节可以看《深度学习框架OneFlow是如何和ONNX交互的?》 一文。oneflow_convert_tools 目前是针对 OneFlow 的 lazy 模式开发,目前最新版本号为 v0.3.2,后面针对 eager 模式的 oneflow_convert_tools 版本号将从 0.4.0 开始

"下集预告"

在下一个版本的 OneFlow 中,将包含更全面的 PyTorch 兼容,包括更多更丰富的接口支持以及多 GPU 支持。同时,下个版本的 OneFlow 也将支持动静图转换的功能。敬请期待!

Hotfix v0.3.4

15 Jan 07:57
Compare
Choose a tag to compare
bump version 0.3b5

Former-commit-id: 10e43bf85c32b791818a53bc871ceddc02a48a78

v0.3.3

13 Jan 15:09
Compare
Choose a tag to compare

Op 修复和性能优化

  • [enhancement][op] reduce sum half kernel #4110
  • [enhancement][op] simplify cosface #4107
  • [enhancement][op] indexed_slices update support weight_decay #4096
  • [enhancement][op][python] Migrate swish and mish namespace from math to nn #4104
  • [enhancement][op] Add elementwise maximum/minimum ops #4069
  • [enhancement][op] Fix Code format warning in hardswish #4105
  • [enhancement][feature][op] Add Scalar Pow #4082
  • [bug][op] Fix bug: mut_shape_view of static output maybe null in UserKernel::ForwardShape #4094
  • [enhancement][op][refactor] Migrate cast_to_static_shape to user op #4095
  • [feature][op] Add GroupNorm op #4089
  • [feature][op] Distributed partial sampler #3857
  • [enhancement][op][python] add Relu6 activation #4029
  • [bug][op] Rename ont_hot_op.cpp to one_hot_op.cpp #4093
  • [bug][op][python] fix hardtanh CI precision error #4091
  • [enhancement][op] add remove_img_without_anno api for COCOReader #4088
  • [enhancement][op] Add Hardtanh activation #4049
  • [enhancement][op] Add ELU activation #4065
  • [enhancement][op][python] Update logsoftmax.py #4041
  • [documentation][op] Fix in_top_k api document #4079
  • [enhancement][op] Add Hardswish activation #4059
  • [enhancement][op][python] Add hard sigmoid #4043
  • [enhancement][op] Dev in top k #3611
  • [bug][op] Fix argwhere tmp buffer infer #4061
  • [enhancement][op] Optimize softmax cuda kernel #4058
  • [feature][op] Add InstanceNorm 1d & 3d implementation #4052
  • [feature][op] Quantization aware training releated ops #3764
  • [enhancement][op] Generic unfold kernel implementation #4033
  • [enhancement][op] User op dim_gather support dynamic input and index #4039
  • [enhancement][op] Reflection pad2d op #3777
  • [enhancement][op] slice support empty blob #4025
  • [bug][enhancement][op] Migrate argwhere to user op #4021
  • [bug][op] Dev rm old tanh #4035
  • [enhancement][op][refactor] Make MaxWithLogThreshold and SafeLog header only #4030
  • [op][purge] Tidy up op_conf.proto #3932
  • [enhancement][op][python] Dev bcewithlogits loss #4024
  • [feature][op] Add implementation of InstanceNorm2D op #4020
  • [enhancement][op][refactor] Refactor gpu_atomic_add #4027
  • [enhancement][op][python] add kldivloss #4012
  • [enhancement][op][python] Dev oneflow ones #3990
  • [enhancement][op] Add flatten/squeeze/expand_dims to auto mixed precision clear list and use reshape instead of reshape_like to do reshape grad computation #4015
  • [enhancement][op][python] add pixel shuffle #4003
  • [enhancement][op] Scalar kernels use element-wise template #4013
  • [enhancement][op][python] add zeros api #3991
  • [enhancement][op] Optimize ComputeEntropyGpu with CUB #3930
  • [feature][op] CUDA template for element-wise kernels #4007

系统组件

  • [enhancement][system] migrate job_build_and_infer api to pybind11 #3940
  • [feature][system] quantization aware training pass #3817
  • [eager][enhancement][system] Mig op arg para attr #4102
  • [feature][system] Tensor Float 32 Support. #4072
  • [enhancement][system] Mig op arg para attr #4090
  • [enhancement][system] Mig py cfg sbp #4086
  • [enhancement][system] Refactor python remote blob #4081
  • [enhancement][system] remove BlobDef #4071
  • [bug][system] Fix warning: moving a local object in a return statement prevents copy elision #4067
  • [enhancement][system] Refactor python blob desc #4063
  • [feature][system] Add nvtx range and thread naming #4064
  • [documentation][enhancement][system] Add docs on installing legacy versions of oneflow #4056
  • [bug][system] support eager empty blob #4047
  • [enhancement][system] Add err info for ncclGroupEnd check #4048
  • [enhancement][system] Optimize dynamic loss scale parameters #4045
  • [purge][system] Remove col_id #4046
  • [enhancement][system] Scope with symbol #4040
  • [enhancement][system] Job desc with symbol #4032
  • [enhancement][system] Parallel desc with symbol #4017
  • [bug][system] change sbp order value for layer norm #3995
  • [bug][system] Fix eager test_resume_training test #4023
  • [bug][system] Fix python cfg error bug #4018
  • [bug][system] Remove redundant pack_size in GenericLauncher #4014
  • [enhancement][system] Set default block size to 512 #4011
  • [feature][system] Remove swig in oneflow #3969
  • [feature][system] Migrate oneflow internal api to pybind11 #3953
  • [build][enhancement][system] Bump nccl from 2.7.3 to 2.8.3 #3875

Eager 模式

  • [bug][eager] Fix eager bug of test split like #4004
  • [bug][eager] add float16 datatype for eager boxing #4092

Python 前端

  • [feature][python] add stack #3897
  • [bug][enhancement][python] Fix test kldivloss tolerance #4103
  • [bug][enhancement][python] Fix "hardsigmoid" eager test error #4085
  • [bug][documentation][python] Add hardsigmoid #4076
  • [api][enhancement][python] add deprecate api optimizer.PolynomialSchduler #4038

工具链

  • [feature][tooling] split_cfg_cpp_and_pybind_generator #4002
  • [enhancement][tooling] Cfg hash #4084
  • [enhancement][tooling] Finetune cfg tool #4050
  • [enhancement][tooling] optimi...
Read more

v0.3.2

16 Dec 11:07
Compare
Choose a tag to compare

Changelog

v0.3.2 (16/12/2020)

  • [enhancement][system] Migrate foreigns to pybind11 #3939
  • [feature][op][python] add swish activation #3970
  • [bug][op] fix argwhere format #4010
  • [enhancement][op] Argwhere support empty blob #4009
  • [feature][op][python] add mish activation #3972
  • [bug][eager] Fix eager memory leak and re-enable new checkpoint #4008
  • [ci][enhancement] upload bin to oss #4000
  • [enhancement][op] Fuse cast scale #3999
  • [enhancement][op] layer_norm_grad_add_to_output #3998
  • [enhancement][system] Optimize NcclCollectiveBoxingExecutorBackend::ExecuteGroup latency #3997
  • [feature][system] OptimizerPlacementOptimization #3944
  • [enhancement][op] Dev optimize prelu #3987
  • [api][enhancement][op] Switch identity to user op and add it to auto mixed precision clear list #3992
  • [enhancement][op] Optimize slice kernel #3989
  • [bug][op] Hotfix: add parallel cast to amp clear list #3988
  • [bottleneck][enhancement][system] Sublinear memory cost by checkpointing #3976
  • [enhancement][system] Add gradients stats aggregation #3979
  • [feature][system] nccl enable mixed fusion #3981
  • [enhancement][op] fused_scale_tril / hot fix matmul / softmax broadcast_sub broadcast_div #3980
  • [bug][op] add combined margin cpu and fix bug #3961
  • [feature][op] Add multi_square_sum op #3977
  • [bug][op] fix pad op #3971
  • [ci][enhancement][test] larger tol for bn #3965
  • [cfg][enhancement][python] Dev replace py job conf proto to cfg #3856
  • [enhancement][refactor][ssp] Dev ssp fix fuse and add just #3959
  • [cfg][enhancement][refactor][tooling] replace ScopeProto to cfg #3816
  • [feature][op] TripOp add fill value #3960
  • [enhancement][system] remove serialized in python callback #3891

v0.3.1

02 Dec 08:59
Compare
Choose a tag to compare

Changelog

v0.3.1 (02/12/2020)

  • [bug][system] Fix CollectiveBoxingGenericTaskNode::ProduceAllRegstsAndBindEdges #3946
  • [bug][op] Fix constant init value #3947
  • [api][enhancement][refactor][tooling] Refine custom op build #3925
  • [feature][op] add combined margin loss #3819
  • [enhancement][tooling] default show cpp error stack frame #3948
  • [cfg][enhancement][tooling] Dev replace py parallel conf proto to cfg #3810
  • [feature][system] Add NaiveB2PSubTskGphBuilder #3942
  • [bug][system] disable new checkpoint by default temporarily #3943
  • [bug][system] Explicitly specify the SBP in NonDistributedOptimizerPass #3937
  • [bug][op] indexed_slices_model_update handle empty tensor #3933
  • [bug][ci] fix oss list file 100 limit #3935

v0.2.0

09 Oct 06:57
Compare
Choose a tag to compare

Changelog

v0.2.0 (09/10/2020)

Op 修复、性能优化

支持二元 add op 与前驱节点融合

  • FuseAddToOutput #3524
  • Dropout support add_to_output #3569
  • Dev matmul add to output #3581

kernel 性能优化

  • Fused BatchNormAddRelu #3519
  • bn_add_relu use bit mask #3645
  • layer_norm param grad #3604
  • Fused layer norm #3591
  • BiasAdd Row Col Half2 #3636
  • MaskAndScaleHalf2 #3643
  • Optimize CudaAsyncMemoryCopier #3543
  • Avoid using local memory in CropMirrorNormalizeGpuKernel #3539
  • LayerNormGpuKernel use fused InstanceScaleCenter #3573

使用 user op 实现 model update ops,以及 model update ops 支持 fusion

  • Add model update user ops #3546
  • Migrate L1L2RegularizeGradientOp to UserOp Framework #3527
  • model update fuse scalar_mul_by_tensor #3635
  • Dev indexed slices model update user ops #3561
  • Dev adam xla and rm sys op #3584

NCCL 支持设置最大融合 op 数量

  • Add nccl_fusion_max_ops #3567

新 op

  • [feature] Fused ImageDecoderRandomCropResize #3644
  • Add AmpWhiteIdentityOp #3658
  • Add ImageDecoderRandomCropResizeOp::InferParallelSignature #3646
  • Dev add op tril #3511
  • add masked fill op #3515

cuDNN 算法推导支持全局缓存

  • Add CudnnConvAlgoCache #3649

Bugfix 与 其他

  • fix broadcast div grad #3525
  • fix optimizer copy-paste bug #3508
  • fix bug about pad value #3640
  • Optimize some default values #3648
  • Fix cuda runtime #3621
  • Fix reshape inplace #3545
  • Refactor rmsprop mean_square and add unit tests for optimizers #3523
  • Remove cuDNN fields from OperatorConf #3536
  • Add UserOpConfWrapperBuilder::ScopeSymbolId #3528
  • Fix NcclCollectiveBoxing builder_name #3563
  • rm conv2d cpu testcase #3574
  • fix broadcast_to_compatible_with grad bug #3609
  • Add inline for half #3600
  • Fix converter half #3599
  • Fix gpu_atomic_max double overload use fmaxf #3578
  • fix upsample #3579

Eager Execution

给eager相关的代码加上更多注释;微调stateless_call指令,区分mutable_input和output两类不同的参数;实现broadcast指令;
  • fix fmt cuda_copy_d2h_stream_type #3606
  • add comments for cuda_copy_d2h_stream_type.cpp #3603
  • Fix TopoForEachNode in GenCollectiveBoxingPlan #3566
  • Split call_op_kernel instruction args into const_input/mutable_input/output #3562
  • split BlobObject and EagerBlobObject #3485
  • remove unused code under vm/ #3585
  • Dev broadcast instruction #3555
  • Broadcast instruction #3552

pybind11 集成

现在 OneFlow 内 SWIG 和 pybind11 共存,之后会逐步切换到 pybind11
  • pybind11 integration #3517
  • upgrad to pybind11 master and pass exe path #3522
  • Update rel script for pybind11 #3526
  • Dev oneflow pybind api #3625

优化、修复编译工具

修复了一些导致编译失败缓慢的不合理配置、加速了依赖下载、 修复了 ubuntu dockerfile
  • [bug] fix ubuntu docker build #3504
  • change link order to fix the cpu+openblas build #3634
  • [bug] fix bug: oneflow cpu-only lib flags #3615
  • add convert_url_to_oss_https_url and DCN flag #3595
  • Add cn url in readme #3583
  • make absl use tar not git #3570
  • Optimize nvcc gencode flag #3577

Transport 网络传输子系统

支持 P2P 动态网络传输
  • [feature] Transport #3549

集成 CFG 工具

CFG 是基于 proto 语法的、生成跨 python、C++ 数据交互代码的工具
  • Dev integrate cfg #3597
  • Less usage of PbMessage in Operator #3651

XLA 支持优化

升级到了 TF 最新版本
  • upgrade XRT XLA to TF 2.3.0 #3531
  • Fix XLA crash #3548

GRPC 升级

升级到了 GRPC 最新版本
  • Upgrade grpc #3551
  • [bug] [bugfix] GRPC: control server CompletionQueue shutdown. #3589

CI、测试优化

将 XLA 也加入 CI,优化了 op 的测试用例,自动上传 master 最新 commit
  • Parallel unit tests (Step 1, refactor existing unit tests) #3632
  • Add build type for pr oss upload #3627
  • XLA ci support #3564
  • Auto upload tar to aliyun oss #3592
  • Don't pack source code if it is not master #3593
  • move fmt to github hosted #3559
  • refactor ci #3557
  • CtrlTest find available port for ctrl port instead of handwriting #3610

ONNX 支持

优化 IR,更新测试脚本

增加、修订文档

Python 前端修复

  • Fix the bug of using op_module_builder in namespace scope #3513
  • Comment release global for now to avoid random crash in python #3629
  • update lib name in link flags #3623
  • rm spaces in rm_spaces optimizer.py #3619

优化、修复系统通用组件

  • [enhancement] flat ErrorProto error_type #3474
  • [enhancement] Added user_op_conf getter for BatchAxisContext/KernelInitContext/SbpContext #3506
  • [bug] Fix UserOpConfWrapper::has_input/has_output #3507
  • support reflecting cfg message #3655
  • Refactor scope #3652
  • Refactor placement scope #3650
  • Bugfix split config proto and session job set #3637
  • [Bug fix] Release global variables #3624
  • Add OpRegistry::SetAreaId #3608
  • Dev converter #3580
  • Tensor::dptr support half #3582
  • Use InferOutBlobDescsIf instead of InferBlobDescsIf in InferOpNodeLogicalBlobDesc #3535
  • Add ctrl_in_op_name only when unreachable #3537

grpc0: change openssl, cares install path

13 Sep 10:34
Compare
Choose a tag to compare
Former-commit-id: 86c49d1163d3b822ca076201d8753d15cf895969

version 0.2b1

10 Sep 11:28
Compare
Choose a tag to compare
version 0.2b1 Pre-release
Pre-release
Former-commit-id: 349289677b3fcd6fd51dd488dda8bf270aa563bf

0.2 beta 0

08 Sep 04:08
Compare
Choose a tag to compare
0.2 beta 0 Pre-release
Pre-release
fix version

Former-commit-id: 65069051da5b822fcc8510c0aaa0c8189e00b61d

0.1.11 beta1

07 Sep 14:28
Compare
Choose a tag to compare
0.1.11 beta1 Pre-release
Pre-release
upgrade XRT XLA to TF 2.3.0 (#3531)

* compile tf 2.3.0 with gcc 7.3

* fix oneflow eigen

* minor fix

* fix include

* update protobuf if xla is on

* update path of tf proto generated cpp files

* fix path in script

* add .clangd to git ignore

* update xla ifs

* update scripts

* update path in script for clangd

* add gitignore

* add cmake flag XRT_TF_URL

* rm comment

* check in changes

* bash tricks to enable gcc 7.3

* use arg to control tuna

* bumpversion

* fix build wheel

* use real path

* add dir for cpu

* fix unwanted yum update cublas

* uncomment all

* rm suffix of wheelhouse_dir

* add log info

Co-authored-by: tsai <caishenghang@1f-dev.kbaeegfb1x0ubnoznzequyxzve.bx.internal.cloudapp.net>
Co-authored-by: tsai <[email protected]>
Former-commit-id: da12e8db4f52d3c5351f0e43f3677dd948d3801d