Changelog
OneFlow 发布了新版本 0.3.2,这个版本以及之前的 0.3.1 版本都是大版本 0.3.0 的 minor 版本,所以在此一并介绍。
在这个版本中,引入了大量性能优化、加入了不少新的 feature,率先支持了 CUDA 11.1。
主要新功能一览
支持亚线性内存优化
通过oneflow.experimental.scope(checkpointing=self.checkpoint_activations)开启,大幅节省内存。例如:def transformer_layer(self, name, x, *, past): # ... with flow.scope.namespace(name): x = flow.identity(x) with flow.experimental.scope.config( checkpointing=self.checkpoint_activations ): norm1 = norm(x, name="layernorm_1") # ...
新版本的 checkpoint
新版本的 checkpoint 大幅提高了灵活性。支持部分加载/保存,支持获取权重的值(可用于打印等操作),支持使用 numpy 数组给权重赋值。with tempfile.TemporaryDirectory() as save_dir: refresh_session() large1 = get_checkpoint_ready_model(model_getter, dtype) flow.checkpoint.save(save_dir) res1 = large1() refresh_session() large2 = get_checkpoint_ready_model(model_getter, dtype) vars_in_file = flow.checkpoint.get(save_dir) flow.load_variables(vars_in_file) res2 = large2() refresh_session() model = get_checkpoint_ready_model(get_add_and_reduce_mean_model, dtype) var_x = flow.get_all_variables()["x"] var_y_value_before_loading = flow.get_all_variables()["y"].numpy() new_val_np = np.random.random(var_x.shape).astype(np.float32) flow.load_variables({ "x": new_val_np}) var_y_value_after_loading = flow.get_all_variables()["y"].numpy() flow_res = model()支持 dynamic loss scale schedule
具体开启方式:loss_scale_policy = flow.optimizer.loss_scale.dynamic_loss_scale(increment_period=2000) optimizer = flow.optimizer.AdamW(..., loss_scale_policy=loss_scale_policy)支持最新的 CUDA 11.1
可以通过如下命令安装:
python3 -m pip install --find-links https://release.oneflow.info oneflow_cu111 --user提供预先编译的带 XLA 张量编译器的安装包(支持CUDA 10,10.1,10.2,11.0)
可以通过如下命令安装:
python3 -m pip install --find-links https://release.oneflow.info oneflow_cu101_xla --user
主要改进和 bug 修复
Changelog v0.3.0 ~ v0.3.2 (16/12/2020)
Op 修复和优化
优化了 scalar mul by tensor, cast scale, prelu,fused_scale_tril 等 Op 和 Op 组合
- [enhancement][op] Dev sx xla clip #3656
 - [enhancement][op] Add UserOp::InferSbpSignature #3699
 - [bug][op] Fix fuse scalar mul by tensor sbp #3692
 - [bug][op] fix softmax condition #3675
 - [enhancement][op] slice_update op #3544
 - [enhancement][op] optimize rmsprop and lars optimizers #3809
 - [enhancement][op] add oneflow_range #3725
 - [enhancement][op] torch.gather #3602
 - [bug][op] skip conv2d padding dynamic test case #3813
 - [bug][op] Fix __hne in BinaryFuncFloorMod #3788
 - [bug][op] Fix bn[_add]_relu test case #3767
 - [enhancement][op][system] Make class Tensor abstract #3757
 - [enhancement][op] Add user_op::KernelCreateContext #3739
 - [bug][op] fix warning #3732
 - [api][enhancement][op] User op registry attr #3716
 - [enhancement][op][refactor] Dev refactor user op registry attr #3714
 - [bug][op] fix argwhere format #4010
 - [enhancement][op] Argwhere support empty blob #4009
 - [enhancement][op] Fuse cast scale #3999
 - [enhancement][op] layer_norm_grad_add_to_output #3998
 - [enhancement][op] Dev optimize prelu #3987
 - [api][enhancement][op] Switch identity to user op and add it to auto mixed precision clear list #3992
 - [enhancement][op] Optimize slice kernel #3989
 - [bug][op] Hotfix: add parallel cast to amp clear list #3988
 - [enhancement][op] fused_scale_tril / hot fix matmul / softmax broadcast_sub broadcast_div #3980
 - [bug][op] add combined margin cpu and fix bug #3961
 - [bug][op] fix pad op #3971
 - [bug][op] Fix constant init value #3947
 - [bug][op] indexed_slices_model_update handle empty tensor #3933
 - [bug][op] fix distribute_clone sbp #3803
 - [bug][op] Reshape backward issue with distribute split #3915
 - [enhancement][op] Remove NormalModelUpdateOpConf #3917
 - [enhancement][op] Dev unsorted segment sum #3731
 - [bug][op] Dev split like add backward #3901
 - [bug][op] distribute concat out dynamic false #3899
 - [enhancement][op] UserOpWrapper add HasGradTensor4OpOutput #3904
 - [enhancement][op] Unpack/Pack user op #3727
 - [enhancement][op] adam_bias_correction_learning_rate #3763
 - [enhancement][op][serving] add flatten op implementation #3789
 - [enhancement][op] Dev enhance sort ops #3828
 - [enhancement][op] Optimize softmax cuda kernel block size #3853
 - [enhancement][op] SplitLikeOp prefix support #3866
 - [bug][op] fix gather set_is_dynamic #3900
 - [bug][op] fix unsorted segment sum like #3898
 
新增 Op 和已有 Op 的新功能
增加了 polyval, swish, mish, multi_square_sum, mseloss, lamb, triplet loss 等 Op
- [enhancement][op] Add polyval op #3541
 - [feature][op] Add broadcast like backward #3665
 - [feature][op] Add cuda_pseudo_half.h #3669
 - [feature][op][python] add swish activation #3970
 - [feature][op][python] add mish activation #3972
 - [feature][op] Add multi_square_sum op #3977
 - [feature][op] TripOp add fill value #3960
 - [feature][op] add combined margin loss #3819
 - [feature][op] dynamic loss scale schedule op #3885
 - [feature][op][python] add mseloss #3893
 - [feature][op] LAMB support #3620
 - [feature][op] logical slice_assign and slice op #3647
 - [feature][op][system] Add Repeat/Acc user op #3707
 - [feature][op][ssp] Ssp variable proxy #3715
 - [feature][op] multi_count_not_finite op #3879
 - [feature][op] model update op add skip if #3883
 - [feature][python] Add triplet loss #3864
 
系统组件
OneFlow Collective Boxing支持NCCL All2All,支持 CUDA11.1 编译
- [feature][system] Add Nccl All2All #3538
 - [WIP][bug][system] Add attribute “batch_axis_non_change” to 
oneflow.transpose#3685 - [bug][system] fix memcopy #3687
 - [documentation][enhancement][system] change url link of api docs #3677
 - [enhancement][system] Op collection #3833
 - [bug][system] fix pybind11 include #3876
 - [enhancement][system] Dev replace str to cfg obj in python callback #3832
 - [enhancement][system] Dev cpp instructions builder #3829
 - [enhancement][system] Dev forward declare cfg #3808
 - [bug][system] Fix CUDA 11.1 compiler crashes #3795
 - [bug][system] Bakcport bug fixes for distributed run from multi node ci #3765
 - [bug][system] Fix handle remote regst #3761
 - [enhancement][system] Refactor ExecKernel::bn_in_op2regst_desc_id to bn_in_op2blob_info #3744
 - [enhancement][system] Dev scope attr value #3756
 - [enhancement][system] rename UserOpAttrVal to AttrValue #3752
 - [enhancement][system] refactor OpGraphPass to JobPass #3745
 - [enhancement][system] RtRegst/Regst GetBlobDesc/BlobByOrdinal #3737
 - [enhancement][system] Log WARNING to stderr #3713
 - [enhancement][system] Use cudaMemcpyDefault #3700
 - [enhancement][system] Migrate foreigns to pybind11 #3939
 - [enhancement][system] Optimize NcclCollectiveBoxingExecutorBackend::ExecuteGroup latency #3997
 - [feature][system] OptimizerPlacementOptimization #3944
 - [feature][system] New checkpoint #3540
 - [enhancement][system] Sublinear memory cost by checkpointing #3976
 - [enhancement][system] Add gradients stats aggregation #3979
 - [feature][system] nccl enable mixed fusion #3981
 - [enhancement][system] remove serialized in python callback #3891
 - [bug][system] Fix CollectiveBoxingGenericTaskNode::ProduceAllRegstsAndBindEdges #3946
 - [feature][system] Add NaiveB2PSubTskGphBuilder #3942
 - [bug][system] disable new checkpoint by default temporarily #3943
 - [bug][system] Explicitly specify the SBP in NonDistributedOptimizerPass #3937
 - [enhancement][system] Add ssp variable proxy #3859
 - [cfg][enhancement][system] Dev switch error proto with cfg error proto #3858
 - [enhancement][refactor][system] New Chain #3874
 - [feature][system] DynamicLossScale #3886
 - [bug][system] Remove CheckNoCycle in chain graph #3693
 - [feature][ssp][system] Memory Reuse support time shape > meta shape #3796
 - [feature][system] OneFlow support tensor shape max dim size up to 6 #3802
 - [bug][enhancement][system] Support Ampere devices #3806
 - [enhancement][system] Simple kernel memory bandwidth profiler #3855
 
Eager 模式
修复了一系列 bug
- [bug][eager] Use universal start global device id for all streams #3701
 - [bug][eager] Ci add eager #3672
 - [bug][eager] Fix eager mode bug #3681
 - [eager][feature] Eager transport #3598
 - [eager][enhancement][python][refactor] rm scope_proto symbol_id #3865
 - [cfg][eager][enhancement] Replace py instruction to CFG Instruction #3773
 - [eager][enhancement][refactor] refactor ParallelDescSymbol #3774
 - [eager][feature] use proxy blob_object for boxing, add some inter-node boxing #3711
 - [bug][eager] fix unpacked mirrored blob object shape #3703
 - [bug][eager] Fix eager memory leak and re-enable new checkpoint #4008
 - [bug][eager] barrier for multi node eager #3748
 
Python 前端
- [api][documentation][python] Dev add api rst #3695
 - [feature][python][refactor] add check in deconv #3835
 - [bug][enhancement][python] fix stirng format in py35 #3878
 - [bug][python] fix exception in BlobObject del #3742
 - [bug][python] make float/double as aliases of float32/float64 #3740
 - [api][bug][documentation][python] Fix placement api doc #3638
 - [cfg][enhancement][python] Dev replace py job conf proto to cfg #3856
 - [feature][python] add bceloss #3804
 - [enhancement][feature][python] add l1 loss op in python #3793
 
工具链
更多的 SWIG 接口由 Pybind11 替换
- [documentation][tooling] Add api docs zzk #3680
 - [documentation][tooling] Add api docs zzk #3587
 - [cfg][enhancement][tooling] Cfg template operator reform #3861
 - [cfg][enhancement][tooling] Dev use union instead of struct for oneof #3870
 - [cfg][enhancement][tooling] Sort cfg obj forward declare #3844
 - [enhancement][tooling] Dev move run instruction to pybind #3775
 - [bug][cfg][tooling] fix cfg module load error bug #3815
 - [bug][tooling] Fix oneflow worker launch in py35 #3778
 - [bug][cfg][tooling] Fix cfg sub proto mudule process bug #3729
 - [enhancement][tooling] Dev data onerec #3104
 - [cfg][enhancement][tooling] Dev compare cfg file #3717
 - [bug][tooling] remove proton not related to Instruction #3708
 - [bug][cfg][tooling] Dev switch instruction to cfg instruction #3702
 - [cfg][enhancement][refactor][tooling] replace ScopeProto to cfg #3816
 - [api][enhancement][refactor][tooling] Refine custom op build #3925
 - [enhancement][tooling] default show cpp error stack frame #3948
 - [cfg][enhancement][tooling] Dev replace py parallel conf proto to cfg #3810
 - [cfg][enhancement][tooling] optimize cfg generator to save time #3906
 - [enhancement][feature][tooling] Py kernel2 #3686
 
编译
修复 NVCC 参数,C++ 11 ABI 在 RedHat GCC 下 CMake 设置错误环境变量,修复编译可能出现的 make -j,修复手动编译的时候 include 目录消失
- [build][documentation] fix readme #3694
 - [bug][build] fix missing symbol when load so #3676
 - [bug][build] Fix CUDA_NVCC_GENCODES #3869
 - [build][documentation] Add info in readme about how to build oneflow in docker #3781
 - [build][ci][enhancement] Add bazel_cache dir for XLA build #3766
 - [bug][build] fix ubuntu build relocation R_X86_64_PC32 against symbol error #3754
 - [build][ci][enhancement] Refactor build script #3698
 - [bug][build] fix make -j in grpc and openssl #3724
 - [bug][build] detect cxx11 abi availibility in cmake #3709
 - [bug][build] fix include files not copied #3907
 
CI
提升运行速度和稳定性,支持分布式环境
- [bug][ci] test use uuid log dir #3689
 - [ci][enhancement] Run check_license_and_format in every branch #3683
 - [ci][feature][test] Parallel run op cases #3670
 - [ci][enhancement] Run xla and pure cpu only when cuda test succeeds #3679
 - [ci][documentation][enhancement] add requirements.txt for api-docs #3671
 - [ci][enhancement] ci add label check workflow #3664
 - [ci][enhancement] CI merge all jobs into one #3868
 - [ci][enhancement] Check label every push #3863
 - [ci][enhancement] Update hard coded host affiliations #3847
 - [ci][enhancement] External PR skip oss steps #3843
 - [ci][enhancement] ci use pull_request ev #3842
 - [ci][enhancement] ci only use pull_request_target #3840
 - [ci][enhancement] Add pull_request_target to allow forks access secrets when CI triggerd #3837
 - [ci][enhancement] CI run when bot is requested review #3831
 - [ci][enhancement] Prevent CI failure #3830
 - [ci][enhancement] ci dont test 2n8c #3786
 - [ci][enhancement] upload bin to oss #4000
 - [ci][enhancement][test] larger tol for bn #3965
 - [bug][ci] fix oss list file 100 limit #3935
 - [ci][enhancement] Refine release oss url #3924
 - [ci][enhancement] Build master whl once a day #3894
 - [ci][feature] Multi node support in CI #3735
 
Test
修复 image resize 测试用例
 
 
 
 
 
 