Replies: 1 comment 1 reply
-
@gs-olive Thanks very much for the documentation! I believe this can help developers who are not familiar with this project understand Torch-TRT as a whole. After reading it, I have some questions, from a newbie perspective.
|
Beta Was this translation helpful? Give feedback.
-
Overview
The Torch-TensorRT Dynamo effort is an ongoing effort to optimize code with TensorRT using the novel
torch.compile
andtorch.export
APIs, introduced with PyTorch 2.X. Beginning from RFC #1825 and the many subsequent Dynamo RFCs, the Torch-TensorRT Dynamo integration was structured similarly to the Torch-TensorRT TorchScript integration which preceded it. The components of the Dynamo integration were mostly designed from the ground-up, using Torch utilities where possible to avoid code duplication, and mirroring much of the work done on the TorchScript path for development of converters for key operators. Below is an overview of the development of the Dynamo paths.Current Torch-TensorRT Dynamo Structure
Graph Inputs
torch.compile
Torch compile can take almost any callable as input, including
torch.fx.GraphModule
, plain functions, andnn.Module
types. Input handling and parsing is automatically handled by thetorch.compile
context manager and the first phase of lowering is done here by PyTorch. Specifically, graphs are formatted to have inputs and outputs which are sequences of Tensors (except in the dynamic shape case, wheretorch.SymInt
can be an input, though we do not support this yet in the Torch-TensorRT backend). Additionally, complex control flow,for
-loops, and Torch-unsupported Python syntax are automatically handled bytorch.compile
, via the Guard mechanism, which intercepts un-traceable code and runs it in Python at runtime.Decompositions [Lowering Phase 1]
Shared between the export and compile paths, we specify a set of decompositions which are applied by the ATen tracer (see next section). The implementation originated directly from the
core_aten_decompositions
provided by Torch, which is an evolving set of decompositions of ATen operators to other lower-level operators. Due to the evolving nature of this set, we later opted to add the user-specifiable optionenable_experimental_decompositions
. This option has two settings:enable_experimental_decompositions = True
Uses all of the
core_aten_decompositions
provided by PyTorch, minus the explicitly disabled set of decompositions, plus the set of custom decompositions written by the Torch-TRT team. See below for the set of explicitly disabled decompositionsTensorRT/py/torch_tensorrt/dynamo/lowering/_decomposition_groups.py
Lines 178 to 180 in 867dc7b
enable_experimental_decompositions = False
A pre-selected set of decompositions, as implemented by Torch, plus the set of custom decompositions written by the Torch-TRT team. See below for the set of explicitly enabled decompositions
TensorRT/py/torch_tensorrt/dynamo/lowering/_decomposition_groups.py
Lines 13 to 177 in 867dc7b
The set of explicitly enabled decompositions are a list of ATen operators whose decompositions are obtained using the
torch._decomp.get_decompositions
function.AOT ATen Lowering [Lowering Phase 2]
torch.compile
Lowering to ATen IR is accomplished using the
aot_export_joint_simple
function, as here:TensorRT/py/torch_tensorrt/dynamo/backend/backends.py
Lines 78 to 86 in 867dc7b
This function takes a
torch.compile
-parsedGraphModule
and lowers it to the ATen IR set, including applying the above decompositions.Graph Lowering Passes [Lowering Phase 3]
Occasionally, there are lowering passes which cannot be accomplished via a decomposition, for instance those requiring multiple operator replacement, subgraph rewriting, or constant folding. These lowering passes are completed in the third lowering phase, for which the ordered passes are detailed below:
TensorRT/py/torch_tensorrt/dynamo/lowering/passes/_aten_lowering_pass.py
Lines 15 to 25 in 867dc7b
Each pass is structured to take in a
GraphModule
and a sequence of input Tensors, and output a validGraphModule
which represents the graph after applying the lowering pass. A brief overview of some key current passes is below:constant_fold
Evaluates any operators containing graph constants and freezes them to graph parameters
TensorRT/py/torch_tensorrt/dynamo/lowering/passes/constant_folding.py
Line 25 in 867dc7b
lower_linear
Re-composes an operator (
aten.linear
) from its decomposed componentsTensorRT/py/torch_tensorrt/dynamo/lowering/passes/lower_linear.py
Line 12 in 867dc7b
fuse_prims_broadcast
Fuses two operators into one, simpler operator
TensorRT/py/torch_tensorrt/dynamo/lowering/passes/fuse_prims_broadcast.py
Line 14 in 867dc7b
repair_input_as_output
Repairs graph cases where a graph input is also a graph output
TensorRT/py/torch_tensorrt/dynamo/lowering/passes/repair_input_as_output.py
Line 13 in 867dc7b
remove_input_alias_fixing_clones
Enables a workaround for an issue with
aot_export_joint_simple
where clones of input Tensors had to be inserted in the GraphModule for proper tracingTensorRT/py/torch_tensorrt/dynamo/lowering/passes/remove_input_alias_fixing_clones.py
Line 13 in 867dc7b
Partitioning
There are two options for partitioning in the Dynamo paths - the "fast partitioner" and the "global partitioner". At a very high-level, the fast partitioner partitions quickly, using a greedy algorithm, but may return suboptimal partitions (extra segmentation) in certain cases. The global partitioner partitions slowly, using a global-partition-fusing algorithm, but generally returns partitions which minimize segmentation in the graph. There are a few other minor differences in these partitioners as well.
use_fast_partitioner=True
Uses the
_SplitterBase
class from Torch, as show below. Returns a newGraphModule
which has submodules_run_on_acc_*
and_run_on_gpu_*
for partitions to run in TensorRT or Torch, respectively. The partitions are built based on an operator support checker class shown below as well.TensorRT/py/torch_tensorrt/dynamo/partitioning/_adjacency_partitioner.py
Line 87 in 867dc7b
TensorRT/py/torch_tensorrt/dynamo/partitioning/_adjacency_partitioner.py
Line 28 in 867dc7b
use_fast_partitioner=False
Uses the
CapabilityBasedPartitioner
class from Torch, as shown below. Returns a newGraphModule
which has submodulesfused_*
for partitions to run in TensorRT and leaves nodes to run in Torch as-is. The partitions are built based on an operator support checker class shown below as well.TensorRT/py/torch_tensorrt/dynamo/partitioning/_global_partitioner.py
Line 21 in 867dc7b
TensorRT/py/torch_tensorrt/dynamo/partitioning/_global_partitioner.py
Line 130 in 867dc7b
Conversion
Conversion Process
Conversion is enabled by the
TRTInterpreter
class which runs thetorch.fx.GraphModule
line-by-line, converting each line to TensorRT via the converter implementations described below. This class selects the correct converter for each ATen operator and assigns input and output shapes and data types while building the TensorRT engine from scratch.TensorRT/py/torch_tensorrt/dynamo/conversion/_TRTInterpreter.py
Line 48 in 867dc7b
The TensorRT engine itself has a few notable user-specifiable options which are
precision
(orenabled_precisions
),max_aux_streams
,version_compatible
, andoptimization_level
. The documentation for each is displayed below.TensorRT/py/torch_tensorrt/dynamo/conversion/_TRTInterpreter.py
Lines 137 to 151 in 867dc7b
Converter Implementations
Converts each of the Torch ATen (or
prims
) operators to the TensorRT layers of the equivalent operation. All operators can be found in the file:py/torch_tensorrt/dynamo/conversion/aten_ops_converters.py
. Each function in the class represents the converter for one or multiple ATen ops. See below for an example of a converter. Each stacked call of@dynamo_tensorrt_converter
represents an ATen operator which can be converted. The@enforce_tensor_types
represents enforcement constraints on the types of Tensors which can be taken as input for a particular argument in the converter.TensorRT/py/torch_tensorrt/dynamo/conversion/aten_ops_converters.py
Lines 57 to 89 in 867dc7b
A detailed discussion on how to write converters can be found here.
Runtime
There are two runtime options in the Dynamo paths - the Python runtime and the C++ runtime. The C++ runtime is suggested since it can be used in serialization for export, and is a more direct interface with the TensorRT API. The Python API is for usage when the C++ dependency is not present.
use_python_runtime = None
Automatically selects the runtime based on the presence of the
torch_tensorrt
C++ package, with a preference for the C++ runtime.use_python_runtime = True
Uses the Python runtime shown below to run inference.
TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py
Line 14 in 867dc7b
use_python_runtime = False
Uses the C++ runtime shown below to run inference.
TensorRT/py/torch_tensorrt/dynamo/runtime/_TorchTensorRTModule.py
Line 19 in 867dc7b
Beta Was this translation helpful? Give feedback.
All reactions