Torch-TensorRT Dynamo Design Documentation for Developers #2475

gs-olive · 2023-11-17T00:02:40Z

gs-olive
Nov 17, 2023
Collaborator

Overview

The Torch-TensorRT Dynamo effort is an ongoing effort to optimize code with TensorRT using the novel torch.compile and torch.export APIs, introduced with PyTorch 2.X. Beginning from RFC #1825 and the many subsequent Dynamo RFCs, the Torch-TensorRT Dynamo integration was structured similarly to the Torch-TensorRT TorchScript integration which preceded it. The components of the Dynamo integration were mostly designed from the ground-up, using Torch utilities where possible to avoid code duplication, and mirroring much of the work done on the TorchScript path for development of converters for key operators. Below is an overview of the development of the Dynamo paths.

Current Torch-TensorRT Dynamo Structure

Graph Inputs

`torch.compile`

Torch compile can take almost any callable as input, including torch.fx.GraphModule, plain functions, and nn.Module types. Input handling and parsing is automatically handled by the torch.compile context manager and the first phase of lowering is done here by PyTorch. Specifically, graphs are formatted to have inputs and outputs which are sequences of Tensors (except in the dynamic shape case, where torch.SymInt can be an input, though we do not support this yet in the Torch-TensorRT backend). Additionally, complex control flow, for-loops, and Torch-unsupported Python syntax are automatically handled by torch.compile, via the Guard mechanism, which intercepts un-traceable code and runs it in Python at runtime.

Decompositions [Lowering Phase 1]

Shared between the export and compile paths, we specify a set of decompositions which are applied by the ATen tracer (see next section). The implementation originated directly from the core_aten_decompositions provided by Torch, which is an evolving set of decompositions of ATen operators to other lower-level operators. Due to the evolving nature of this set, we later opted to add the user-specifiable option enable_experimental_decompositions. This option has two settings:

enable_experimental_decompositions = True
Uses all of the core_aten_decompositions provided by PyTorch, minus the explicitly disabled set of decompositions, plus the set of custom decompositions written by the Torch-TRT team. See below for the set of explicitly disabled decompositions

TensorRT/py/torch_tensorrt/dynamo/lowering/_decomposition_groups.py

Lines 178 to 180 in 867dc7b

    
           torch_disabled_decompositions: Set[Union[OpOverload, OpOverloadPacket]] = { 
        
               aten._softmax.default, 
        
           }

enable_experimental_decompositions = False
A pre-selected set of decompositions, as implemented by Torch, plus the set of custom decompositions written by the Torch-TRT team. See below for the set of explicitly enabled decompositions

TensorRT/py/torch_tensorrt/dynamo/lowering/_decomposition_groups.py

Lines 13 to 177 in 867dc7b

    
           torch_enabled_decompositions: Set[Union[OpOverload, OpOverloadPacket]] = { 
        
               aten._adaptive_avg_pool2d_backward, 
        
               aten.addcdiv, 
        
               aten.addcdiv_, 
        
               aten.addcmul, 
        
               aten.addcmul_, 
        
               aten.addr, 
        
               aten.aminmax, 
        
               aten.arange.default, 
        
               aten.arange.start, 
        
               aten.avg_pool2d_backward, 
        
               aten.binary_cross_entropy, 
        
               aten.binary_cross_entropy_backward, 
        
               aten.binary_cross_entropy_with_logits, 
        
               aten.celu, 
        
               aten.col2im, 
        
               aten.count_nonzero, 
        
               aten.cudnn_batch_norm, 
        
               aten.cudnn_batch_norm_backward, 
        
               aten.deg2rad, 
        
               aten.detach, 
        
               aten.diag_embed, 
        
               aten.diagonal_backward, 
        
               aten.dot, 
        
               aten.elu_backward, 
        
               aten._embedding_bag, 
        
               aten.embedding_dense_backward, 
        
               aten._euclidean_dist.default, 
        
               aten.expand_as, 
        
               aten.eye, 
        
               aten.fill, 
        
               aten.frac, 
        
               aten._fused_moving_avg_obs_fq_helper, 
        
               aten.gelu, 
        
               aten.gelu_backward, 
        
               aten.glu_backward, 
        
               aten.grid_sampler_2d, 
        
               aten.hardshrink, 
        
               aten.hardshrink_backward, 
        
               aten.hardsigmoid, 
        
               aten.hardsigmoid_backward, 
        
               aten.hardswish, 
        
               aten.hardswish_, 
        
               aten.hardswish_backward, 
        
               aten.hardtanh, 
        
               aten.hardtanh_, 
        
               aten.hardtanh_backward, 
        
               aten.heaviside, 
        
               aten.huber_loss, 
        
               aten.huber_loss_backward, 
        
               aten.im2col, 
        
               aten.index_add, 
        
               aten.index_add_, 
        
               aten.index_copy, 
        
               aten.index_copy_, 
        
               aten.index_fill, 
        
               aten.index_fill_, 
        
               aten.index_select, 
        
               aten.isneginf, 
        
               aten.isposinf, 
        
               aten.l1_loss, 
        
               aten.leaky_relu, 
        
               aten.leaky_relu_, 
        
               aten.leaky_relu_backward, 
        
               aten.lerp, 
        
               aten.linspace, 
        
               aten.logaddexp, 
        
               aten.logaddexp2, 
        
               aten.logit, 
        
               aten.logit_backward, 
        
               aten.log_sigmoid_backward, 
        
               aten.log_sigmoid_forward, 
        
               aten._log_softmax, 
        
               aten._log_softmax_backward_data, 
        
               aten.logspace, 
        
               aten.logsumexp.default, 
        
               aten.masked_fill, 
        
               aten.masked_fill_, 
        
               aten.max_pool2d_with_indices_backward, 
        
               aten.mish, 
        
               aten.mse_loss, 
        
               aten.mse_loss_backward, 
        
               aten.mvlgamma, 
        
               aten.nansum, 
        
               aten.nan_to_num, 
        
               aten.narrow, 
        
               # TODO: Disable the below operators once freezing is done 
        
               aten.native_batch_norm_backward, 
        
               aten._native_batch_norm_legit, 
        
               aten._native_batch_norm_legit_functional, 
        
               aten.native_dropout_backward, 
        
               aten.native_group_norm_backward, 
        
               aten.native_layer_norm_backward, 
        
               aten.new_empty, 
        
               aten.new_full, 
        
               aten.new_ones, 
        
               aten.new_zeros, 
        
               aten.nll_loss_backward, 
        
               aten.nll_loss_forward, 
        
               aten.norm, 
        
               aten.ones, 
        
               aten.ones_like, 
        
               aten._prelu_kernel, 
        
               aten._prelu_kernel_backward, 
        
               aten._reshape_alias, 
        
               aten.rad2deg, 
        
               aten.renorm, 
        
               aten.renorm_, 
        
               aten.rot90, 
        
               aten.rsub, 
        
               aten.select_backward, 
        
               aten.select_scatter, 
        
               aten.sgn, 
        
               aten.sigmoid_backward, 
        
               aten.silu, 
        
               aten.silu_, 
        
               aten.silu_backward, 
        
               aten.sinc, 
        
               aten.slice_backward, 
        
               aten.smooth_l1_loss, 
        
               aten.smooth_l1_loss_backward, 
        
               aten.soft_margin_loss, 
        
               aten.soft_margin_loss_backward, 
        
               aten._softmax.out, 
        
               aten._softmax_backward_data, 
        
               aten.softplus, 
        
               aten.softplus_backward, 
        
               aten.softshrink, 
        
               aten.softshrink_backward, 
        
               aten.special_entr, 
        
               aten.special_log_ndtr, 
        
               aten.special_xlog1py, 
        
               aten.stack, 
        
               aten.std, 
        
               aten.t, 
        
               aten.tanh_backward, 
        
               aten.threshold, 
        
               aten.threshold_backward, 
        
               aten.trace, 
        
               aten.transpose.int, 
        
               aten.tril.default, 
        
               aten.triu.default, 
        
               aten.unbind, 
        
               aten.unfold, 
        
               aten.unfold_backward, 
        
               aten.unfold_copy, 
        
               aten._unsafe_index, 
        
               aten.upsample_bilinear2d, 
        
               aten.upsample_bilinear2d.vec, 
        
               aten.upsample_nearest2d_backward, 
        
               aten.var, 
        
               aten.var_mean, 
        
               aten.xlogy, 
        
               aten.zero, 
        
               aten.zero_, 
        
               aten.zeros, 
        
               aten.zeros_like, 
        
               # Non-default convenience decompositions 
        
               aten.clamp_min, 
        
               aten.clamp_max, 
        
               aten.linalg_vector_norm, 
        
               aten.full, 
        
               aten.repeat, 
        
               aten.var_mean, 
        
           }

The set of explicitly enabled decompositions are a list of ATen operators whose decompositions are obtained using the torch._decomp.get_decompositions function.

AOT ATen Lowering [Lowering Phase 2]

`torch.compile`

Lowering to ATen IR is accomplished using the aot_export_joint_simple function, as here:

TensorRT/py/torch_tensorrt/dynamo/backend/backends.py

Lines 78 to 86 in 867dc7b

    
           # Invoke AOTAutograd to translate operators to aten 
        
           gm = aot_export_joint_simple( 
        
               gm, 
        
               sample_inputs, 
        
               trace_joint=False, 
        
               decompositions=get_decompositions( 
        
                   settings.enable_experimental_decompositions 
        
               ), 
        
           )

This function takes a torch.compile-parsed GraphModule and lowers it to the ATen IR set, including applying the above decompositions.

Graph Lowering Passes [Lowering Phase 3]

Occasionally, there are lowering passes which cannot be accomplished via a decomposition, for instance those requiring multiple operator replacement, subgraph rewriting, or constant folding. These lowering passes are completed in the third lowering phase, for which the ordered passes are detailed below:

TensorRT/py/torch_tensorrt/dynamo/lowering/passes/_aten_lowering_pass.py

Lines 15 to 25 in 867dc7b

    
           ATEN_LOWERING_PASSES = DynamoPassManager.build_from_passlist( 
        
               [ 
        
                   remove_input_alias_fixing_clones, 
        
                   constant_fold, 
        
                   repair_input_as_output, 
        
                   lower_efficient_attention, 
        
                   lower_linear, 
        
                   fuse_prims_broadcast, 
        
                   replace_max_pool_with_indices, 
        
               ] 
        
           )

Each pass is structured to take in a GraphModule and a sequence of input Tensors, and output a valid GraphModule which represents the graph after applying the lowering pass. A brief overview of some key current passes is below:

constant_fold
Evaluates any operators containing graph constants and freezes them to graph parameters

TensorRT/py/torch_tensorrt/dynamo/lowering/passes/constant_folding.py

Line 25 in 867dc7b

def constant_fold(
lower_linear
Re-composes an operator (aten.linear) from its decomposed components

TensorRT/py/torch_tensorrt/dynamo/lowering/passes/lower_linear.py

Line 12 in 867dc7b

def lower_linear(
fuse_prims_broadcast
Fuses two operators into one, simpler operator

TensorRT/py/torch_tensorrt/dynamo/lowering/passes/fuse_prims_broadcast.py

Line 14 in 867dc7b

def fuse_prims_broadcast(
repair_input_as_output
Repairs graph cases where a graph input is also a graph output

TensorRT/py/torch_tensorrt/dynamo/lowering/passes/repair_input_as_output.py

Line 13 in 867dc7b

def repair_input_as_output(
remove_input_alias_fixing_clones
Enables a workaround for an issue with aot_export_joint_simple where clones of input Tensors had to be inserted in the GraphModule for proper tracing

TensorRT/py/torch_tensorrt/dynamo/lowering/passes/remove_input_alias_fixing_clones.py

Line 13 in 867dc7b

def remove_input_alias_fixing_clones(

Partitioning

There are two options for partitioning in the Dynamo paths - the "fast partitioner" and the "global partitioner". At a very high-level, the fast partitioner partitions quickly, using a greedy algorithm, but may return suboptimal partitions (extra segmentation) in certain cases. The global partitioner partitions slowly, using a global-partition-fusing algorithm, but generally returns partitions which minimize segmentation in the graph. There are a few other minor differences in these partitioners as well.

`use_fast_partitioner=True`

Uses the _SplitterBase class from Torch, as show below. Returns a new GraphModule which has submodules _run_on_acc_* and _run_on_gpu_* for partitions to run in TensorRT or Torch, respectively. The partitions are built based on an operator support checker class shown below as well.

TensorRT/py/torch_tensorrt/dynamo/partitioning/_adjacency_partitioner.py

Line 87 in 867dc7b

class TRTPartitioner(_SplitterBase): # type: ignore

TensorRT/py/torch_tensorrt/dynamo/partitioning/_adjacency_partitioner.py

Line 28 in 867dc7b

class OpSupportTester(ops.OperatorSupportBase): # type: ignore

`use_fast_partitioner=False`

Uses the CapabilityBasedPartitioner class from Torch, as shown below. Returns a new GraphModule which has submodules fused_* for partitions to run in TensorRT and leaves nodes to run in Torch as-is. The partitions are built based on an operator support checker class shown below as well.

TensorRT/py/torch_tensorrt/dynamo/partitioning/_global_partitioner.py

Line 21 in 867dc7b

class TRTPartitioner(CapabilityBasedPartitioner): # type: ignore[misc]

TensorRT/py/torch_tensorrt/dynamo/partitioning/_global_partitioner.py

Line 130 in 867dc7b

class TorchTensorRTOperatorSupport(OperatorSupport): # type: ignore[misc]

Conversion

Conversion Process

Conversion is enabled by the TRTInterpreter class which runs the torch.fx.GraphModule line-by-line, converting each line to TensorRT via the converter implementations described below. This class selects the correct converter for each ATen operator and assigns input and output shapes and data types while building the TensorRT engine from scratch.

TensorRT/py/torch_tensorrt/dynamo/conversion/_TRTInterpreter.py

Line 48 in 867dc7b

class TRTInterpreter(torch.fx.Interpreter): # type: ignore[misc]

The TensorRT engine itself has a few notable user-specifiable options which are precision (or enabled_precisions), max_aux_streams, version_compatible, and optimization_level. The documentation for each is displayed below.

TensorRT/py/torch_tensorrt/dynamo/conversion/_TRTInterpreter.py

Lines 137 to 151 in 867dc7b

    
                   Args: 
        
                       workspace_size: Amount of memory used by TensorRT to store intermediate buffers within an operation. 
        
                       precision: the precision model layers are running on (TensorRT will choose the best perforamnce precision). 
        
                       sparse_weights: allow the builder to examine weights and use optimized functions when weights have suitable sparsity 
        
                       force_fp32_output: force output to be fp32 
        
                       strict_type_constraints: Usually we should set it to False unless we want to control the precision of certain layer for numeric reasons. 
        
                       algorithm_selector: set up algorithm selection for certain layer 
        
                       timing_cache: enable timing cache for TensorRT 
        
                       profiling_verbosity: TensorRT logging level 
        
                       max_aux_streams: Maximum number of allowed auxiliary TRT streams for each engine 
        
                       version_compatible: Provide version forward-compatibility for engine plan files 
        
                       optimization_level: Builder optimization 0-5, higher levels imply longer build time, 
        
                           searching for more optimization options. TRT defaults to 3 
        
                   Return: 
        
                       TRTInterpreterResult

Converter Implementations

Converts each of the Torch ATen (or prims) operators to the TensorRT layers of the equivalent operation. All operators can be found in the file: py/torch_tensorrt/dynamo/conversion/aten_ops_converters.py. Each function in the class represents the converter for one or multiple ATen ops. See below for an example of a converter. Each stacked call of @dynamo_tensorrt_converter represents an ATen operator which can be converted. The @enforce_tensor_types represents enforcement constraints on the types of Tensors which can be taken as input for a particular argument in the converter.

TensorRT/py/torch_tensorrt/dynamo/conversion/aten_ops_converters.py

Lines 57 to 89 in 867dc7b

    
           @dynamo_tensorrt_converter( 
        
               torch.ops.aten.native_batch_norm.default, capability_validator=one_user_validator 
        
           ) 
        
           @dynamo_tensorrt_converter(torch.ops.aten.batch_norm.default) 
        
           @dynamo_tensorrt_converter(torch.ops.aten.batch_norm) 
        
           @enforce_tensor_types( 
        
               { 
        
                   0: (TRTTensor,), 
        
               } 
        
           ) 
        
           def aten_ops_batch_norm( 
        
               ctx: ConversionContext, 
        
               target: Target, 
        
               args: Tuple[Argument, ...], 
        
               kwargs: Dict[str, Argument], 
        
               name: str, 
        
           ) -> Union[TRTTensor, Sequence[TRTTensor]]: 
        
               return impl.normalization.batch_norm( 
        
                   ctx, 
        
                   target, 
        
                   SourceIR.ATEN, 
        
                   name, 
        
                   input=args[0], 
        
                   weight=args[1], 
        
                   bias=args[2], 
        
                   running_mean=args[3], 
        
                   running_var=args[4], 
        
                   training=args[5], 
        
                   momentum=args[6], 
        
                   eps=args[7], 
        
                   cudnn_enabled=args_bounds_check(args, 8, True), 
        
                   return_mean_rstd=(target == torch.ops.aten.native_batch_norm.default), 
        
               )

A detailed discussion on how to write converters can be found here.

Runtime

There are two runtime options in the Dynamo paths - the Python runtime and the C++ runtime. The C++ runtime is suggested since it can be used in serialization for export, and is a more direct interface with the TensorRT API. The Python API is for usage when the C++ dependency is not present.

`use_python_runtime = None`

Automatically selects the runtime based on the presence of the torch_tensorrt C++ package, with a preference for the C++ runtime.

`use_python_runtime = True`

Uses the Python runtime shown below to run inference.

TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py

Line 14 in 867dc7b

class PythonTorchTensorRTModule(Module): # type: ignore[misc]

`use_python_runtime = False`

Uses the C++ runtime shown below to run inference.

TensorRT/py/torch_tensorrt/dynamo/runtime/_TorchTensorRTModule.py

Line 19 in 867dc7b

class TorchTensorRTModule(torch.nn.Module): # type: ignore[misc]

zewenli98 · 2023-11-18T02:15:16Z

zewenli98
Nov 18, 2023
Collaborator

@gs-olive Thanks very much for the documentation! I believe this can help developers who are not familiar with this project understand Torch-TRT as a whole. After reading it, I have some questions, from a newbie perspective.

Decompositions [Lowering Phase 1]

Shared between the export and compile paths, we specify a set of decompositions which are applied by the ATen tracer (see next section).

You mentioned export and compile paths. How many paths do we have? Why should we need these paths? Any difference between them? In which case which path should be used?

enable_experimental_decompositions = False
A pre-selected set of decompositions, as implemented by Torch, plus the set of custom decompositions written by the Torch-TRT team. See below for the set of explicitly enabled decompositions

Is this set (A pre-selected set of decompositions) a subset of core_aten_decompositions? How to pre-select? based on which norm?
Either enable_experimental_decompositions is True or False, the set of custom decompositions written by the Torch-TRT team will be used anyway?

AOT ATen Lowering [Lowering Phase 2]

torch.compile

Lowering to ATen IR is accomplished using the aot_export_joint_simple function, as here:

For the function name, how to understand the words: export, joint, simple?

Partitioning

Why do we need Partitioning? What are the input and output of Partitioning stage? For TensorRT optimization?

1 reply

gs-olive Nov 28, 2023
Collaborator Author

Thank you for the detailed questions. To address them:

1. Paths

We have four paths currently:

TorchScript
First path developed for Torch-TRT. Uses Torch's TorchScript IR, which is a lower-level deep learning language that generally uses ATen ops as well. Best suited for models needing to be exported or serialized.
FX
Second path developed for Torch-TRT. Uses Torch's FX IR, along with a variety of different tracing options. Dynamo paths are generally preferred over FX.
Dynamo Export
Utilizes new Dynamo ATen + AOT utilities to create an ATen graph in FX IR. Still enables serialization and export, but does not support complex Python code or graph breaks.
torch.compile
General purpose acceleration using torch.compile, allowing for graph breaks, complex Python code, and compiling ATen graphs in much the same way as Dynamo Export. Preferable for fast compilation, benchmarking, and debugging.

2. Core ATen Decompositions

The preselected set of decompositions, which is used when enable_experimental_decompositions=False, is selected by the Torch-TRT team for best coverage of the ATen ops in TRT. It has a lot of overlap with the core_aten_decompositions, but it is not a subset. Specifically, core_aten_decompositions includes some decompositions like _softmax, which Torch-TRT does not use, while our preselected set includes some decompositions which core_aten_decompositions does not use.

3. Torch-TRT Decompositions

Yes, the custom Torch-TRT decompositions will be included anyway regardless of the value of enable_experimental_decompositions. That flag is primarily intended to control the inclusion of core_aten_decompositions, so if we have any custom decompositions written for Torch-TRT, these are generally considered to take priority over any other decompositions.

4. `aot_export_joint_simple`

This is a torch function, as linked here, and I believe export implies the FX graph cannot have graph breaks, joint means the graph can be traced both forwards and backwards (we only use the former), and simple could imply that the tracer does a minimal amount of graph modification.

5. Why partitioning?

We need partitioning because certain operators don't have implemented converters to TRT, so the partitioning phase splits the graph into subgraphs which we have converters for (to run in TRT), and subgraphs which we do not have converters for (to run in Torch).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Torch-TensorRT Dynamo Design Documentation for Developers #2475

{{title}}

Replies: 1 comment 1 reply

{{title}}

Decompositions [Lowering Phase 1]

AOT ATen Lowering [Lowering Phase 2]

`torch.compile`

Partitioning

{{title}}

Select a reply

Torch-TensorRT Dynamo Design Documentation for Developers #2475

gs-olive Nov 17, 2023 Collaborator

Overview

Current Torch-TensorRT Dynamo Structure

Graph Inputs

torch.compile

Decompositions [Lowering Phase 1]

AOT ATen Lowering [Lowering Phase 2]

torch.compile

Graph Lowering Passes [Lowering Phase 3]

Partitioning

use_fast_partitioner=True

use_fast_partitioner=False

Conversion

Conversion Process

Converter Implementations

Runtime

use_python_runtime = None

use_python_runtime = True

use_python_runtime = False

Replies: 1 comment · 1 reply

zewenli98 Nov 18, 2023 Collaborator

Decompositions [Lowering Phase 1]

AOT ATen Lowering [Lowering Phase 2]

torch.compile

Partitioning

gs-olive Nov 28, 2023 Collaborator Author

1. Paths

2. Core ATen Decompositions

3. Torch-TRT Decompositions

4. aot_export_joint_simple

5. Why partitioning?

gs-olive
Nov 17, 2023
Collaborator

`torch.compile`

`torch.compile`

`use_fast_partitioner=True`

`use_fast_partitioner=False`

`use_python_runtime = None`

`use_python_runtime = True`

`use_python_runtime = False`

Replies: 1 comment 1 reply

zewenli98
Nov 18, 2023
Collaborator

`torch.compile`

gs-olive Nov 28, 2023
Collaborator Author

4. `aot_export_joint_simple`