SMAUG
Simulating Machine Learning Applications on gem5-Aladdin
|
This page describes how the SMAUG tiling optimizers work. The idea of the SMAUG tiling system is to provide a basic scaffolding on which end users can write new tiling optimizers for new operators, but also allows users to completely depart from this scaffolding and design an entirely custom system if they desire. SMAUG's design is merely one design point and implementation choice and does not necessarily represent the "optimal" point. How to optimally tile an operation to maximize data reuse and minimize data movement is an area of active research.
If you are looking for a tutorial that walks through writing a very simple tiling optimizer, see Tile your Tensors. This page is designed to show you how to use SMAUG APIs to write tiling optimizers for more complex operators.
For convenience, this section uses examples of tiling on convolutions, but the design applies to all operators.
The SMAUG tiling optimizers follow a three-step procedure to tile an operator:
A tiling strategy defines the best set of dimensions along which to tile the input and output tensors. This depends on three parameters:
Given these parameters, the tiling strategy is determined by a preference order on tiling dimensions. The supported tiling strategies are described by a Backend-specific TilingDims enum (e.g. smaug::smv::TilingDims). For example, the preference order of the SMV convolution operator (whose accelerator is based on the NVDLA convolution engine) looks like this:
DimN
).DimNC
).DimNH
).For the SMV backend, this preference order for a single tensor is implemented at smaug::smv::TilingOptimizerBase::findBestTilingDims. Individual operators call this function on a per-input tensor basis, then combine the recommendations to pick either a more globally optimal strategy or one that is simpler to reason about. As an example, see the SMV convolution tiling logic at smaug::smv::conv::determineBestTilingDims.
The order of tiling strategy preference depends on the accelerator's dataflow and the amount of work required to shuffle and reshape the data into the smaller tiles. Let's consider a simple example. If the dataflow reduces partial products along the channel dimension, then we generally want to maximize this dimension so more values can be reduced in-place. This also has benefits for tiling: if the innermost dimension of a tensor is the channel dimension, then tiling strategies which leave this dimension intact will generally be faster than those which break it up, because leaving it intact results in fewer, larger contiguous memcpy
operations.
Once we have determined a tiling strategy, the next step is to find the best tile shape for this strategy. In this context, "best" simply means "the size that maximizes use of local scratchpad space", while respecting minimum and maximum tile size constraints. This is done by first enumerating all the possible tile shapes for all inputs and outputs to form a list of tiling configurations (which describe how all input and output tensors will be tiled), then picking the tiling configuration which maximizes the use of the local scratchpad space.
For elementwise and unary operators, this is easy to determine, because the only requirement is that all tiles be of the same shape. In fact, for the SMV backend, elementwise/unary operators don't even bother with determining a tiling strategy; they just jump directly to picking the biggest tile shape and emitting a set of tiled tensors (smaug::smv::unary::doTiling).
For operators whose input tensors impose various constraints on each other, this can become more complex. For example, the field size of a convolution operator's weight limits the smallest practical size of a corresponding input tile. In these cases, we proceed tensor by tensor, strategy by strategy. See smaug::smv::conv::computeBasicTileShapes.
DimN
case: The result is a tile shape for all input and output tensors for this operator.
With a tiling strategy and a tile shape on hand, we are ready to actually break up the original tensor into its tiles. There are already several functions to do this for a variety of use cases, defined in tensor_utils.h:
All of these functions take a final optional copyData
bool parameter, which if true
copies the appropriate window of data from the source tensor into each tile. This is typically true
for input tensors and false
for output tensors (since any data we write into them during tiling will get overwritten later by the operator itself).
These functions are sufficient for the majority of cases. But sometimes even more custom handling is required for some operators, like smaug::smv::conv::TilingOptimizer::generateRowwiseOutputTiledTensor.
The final step is to write a loop nest that invokes the target accelerator using the tiled tensors. This is typically implemented as a runX
/ runNHWC
/ etc function. Depending on the operator, this can be as straightforward as a single for
loop, or it can be considerably more complex if tile reuse can be exploited to gain performance. Examples: