SMAUG
Simulating Machine Learning Applications on gem5-Aladdin
|
In this document, we will describe how to build a custom operator with a custom hardware accelerator model implementing the logic. Our custom operator will perform an element-wise add of two tensors.
A backend is a way to logically group together a set of related operators and/or enforce shared properties on instantiations of operators. For example, a backend may logically require that operators share a common set of compute resources/global variables, impose the same zero-padding requirements on data, and more. SMAUG ships with two backends:
In SMAUG, a Backend is represented as a class comprised purely of static functions and variables. They are defined in core/backend.h and core/backend.cpp. Backend classes are used as template parameters to Operator subclasses, so they must be statically interchangeable. Thus, all backend definitions must statically define the same set of functions and variables, which means that they must also support every operator type.
After building your custom Operator, you will need to include and register the new operator in those files. We will discuss this more once we get to that step.
When SMAUG reads the model topology proto, it creates named Operator objects and places them in a global Workspace. Any Tensor or Operator can be looked up by name in the workspace. By convention, SMAUG first creates an empty Operator of the appropriate type with a common constructor signature, then uses type-specific setters to fill in all the parameters. After all operators are constructed, SMAUG automatically adds edges in the graph to link dependent operators together. For example, here is a typical operator construction pattern (see network_builder.cpp for more examples):
Note that operator constructors are invoked by a Backend::createXXXOperator
function (created when registering a new operator in the backend). Every Operator's constructor must accept the same two arguments: name and workspace, and it must invoke the parent class's constructor.
More importantly, note that at construction time, we do not set or create tensors as parameters to operators. Instead, we set the dimensions of tensors and create than at a later time. Here, we provided a setter for the dimensions of a 4D convolution's weights - filter size (1x2x3) and number of output feature maps (4). But we do not set the dimensions for the input or output activation tensors. The dimensions of the input tensor depend on the previous operator in the graph, and the dimensions of the output in turn depends on the input. At operator construction time, these relationships are not yet known.
Once all operators are constructed, how does SMAUG connect an output tensor of operator A to the input tensor of operator B? What happens if operator B has many input tensors, each of which have different meanings? The answer is that the base Operator class contains an ordered list of inputs and outputs. Each operator implementation publishes the number of inputs and outputs it has along with the meaning of each one (e.g. input tensor 0 represents activations and input tensor 1 represents weights). This ordering is reflected in to the Python API and encoded in the model topology proto. SMAUG uses this information to link operators together with the Operator::setInput and Operator::setOutput APIs. This information is typically encoded as enums:
Putting this all together, below is a simple example of a custom Operator that has no backend-specific behavior. Place this code into smaug/operators/my_custom_operator.h
.
Now we can integrate this custom operator into SMAUG. To do so, we need to make a few more modifications:
OpType
enum for this operator to smaug/core/types.proto.DECL_CREATE_OP(MyCustomOperator)
to all backends in backend.h.DEF_CREATE_OP(MyCustomOperator, Backend)
for all backends in backend.cpp.createAndAddOperator
: SRCS
variable in smaug/make/Makefile.common.In order to use your new operator in a model, you also need to add an API to create it in the Python API. See the Python documentation for details.
We've written the skeleton of a new custom operator, but it currently doesn't do anything. Our custom operator is supposed to take two tensors and add them elementwise. In this section, we'll learn how to implement this. We'll first write and test a CPU-only implementation (no interaction with Aladdin) to familiarize ourselves with SMAUG APIs. Afterwards, we'll modify this to work with the gem5-Aladdin family of tools.
The first step of implementing the actual operator is to create the tensors to store the output. In practice, the Python API will compute shapes for all Tensors, and the network builder will handle creation and population of Tensor objects into each Operator. However, for testing purposes, we also implement a createAllTensors
virtual function to do this all in a single step. For an elementwise add, the output tensor's shape is the same as the inputs.
We should also verify that the inputs to our operator match our expectations. There are several common properties to validate:
In our example, an elementwise addition requires that the two input tensors be of the same shape, the data type to be single-precision float, but supports all data layouts. It doesn't matter whether the data is stored as NCHW/NHWC/NC, because the operation is elementwise.
This validation is provided by a validate
API which runs after the network is fully constructed:
Now, we can write the run
function which implements the operator's function itself.
With the implementation complete, let's try it out with a unit test. SMAUG uses the Catch2 framework for unit testing, and the SmaugTest
fixture provides a range of useful testing utilities. Open up a new cpp file (my_custom_operator_test.cpp):
Add your new test to the TESTS
variable in make/Make.common
. Then build the unit tests with make tests
and run ./smaug/operators/my_custom_operator_test
.
Now that our software-only implementation is working, we are going to adapt it to serve as an Aladdin model for a hardware accelerator. This is a multi-step process:
Accelerators have a limited amount of local memory that can be directly accessed. When offloading work to an accelerator, any data it needs must be copied into this local memory first before computation can begin. If the input data is larger than this local memory, that data must be tiled into smaller pieces that can fit. How much memory to allocate is a design tradeoff between performance and hardware cost. This is a tradeoff that SMAUG can help researchers investigate for a particular workload.
A piece of accelerator-local memory is represented in SMAUG as a global array that is only accessed within the accelerated function. These are typically declared under the appropriate namespaces in backend.h and defined in backend.cpp. For example, the SMV backend has three scratchpads (spad0
, spad1
, and spad2
). spad0
and spad1
are typically used for inputs, while spad2
is used for outputs, but this is merely a convention.
For our custom operator, let's add two scratchpads to the Reference backend. Open up backend.cpp and add two array definitions as shown below. Also, add a unique integer ID for this custom operator. We'll use it later when invoking the accelerator in simulation.
Then, open up backend.h and add extern declarations for them:
We need to allocate memory for these new arrays before they can be used. By this point, you should have picked a scratchpad size. Let's say you picked 32KB. In backend.h, modify ReferenceBackend
like so:
Your accelerator-local scratchpads have now been correctly configured. We will modify our kernel function to use these shortly.
Writing logic to correctly tile your tensors can be tricky. There are lots of corner cases and special features of the data that must be taken into account so that the accelerator can still compute the correct output with partial data. To ease this, SMAUG provides a library of useful tiling functionality that you can reuse. For more information on the SMAUG tiling optimizer design, refer to Tiling optimizers in SMAUG.
Fortunately, tiling a tensor for an elementwise operator is the easiest kind of tiling there is - no need to consider data reuse, striding, etc. We simply need to break up the two input tensors and output tensor into equal sized pieces which maximize the use of the local scratchpads. This can be accomplished with just a few lines of code in my_custom_operator.h
by overriding the tile()
function and calling the smaug::generateTiledTensorPerBatchNC function. The result will be an array of TiledTensor objects. A TiledTensor represents a spatially ordered collection of Tensor objects, each containing a copy of a slice of data from an underlying Tensor.
This part is straightforward. Iterate over each of the tensor tiles and run the elementwise_add
function on their contents. Finally, flatten the tiled output tensor back into a single one. Since this is an elementwise operation, there is no need to consider data reuse, so a simple for loop will do. For more complex operators in which data reuse is a critical factor to optimize for, changing the order of iteration along the dimensions may greatly affect performance.
As usual, let's add a unit test to verify that this implementation is correct. We need to ensure that our inputs are larger than the scratchpads we created in order for tiling to make any meaningful change. Add this code to your existing unit test.
From the perspective of the elementwise_add
function, we've been writing our code as if it could directly access the contents of the input0
, input1
, and output
arrays. These are pointers into host memory. But if it's representing a block of hardware, and the hardware can only access local scratchpads, we need to copy the contents of the host memory into the scratchpads. In gem5-Aladdin, this is accomplished with special functions: dmaLoad
and dmaStore
. They work just like regular calls to memcpy
, except LLVM-Tracer will recognize this call as a special function call and have Aladdin handle it appropriately. This is handled as follows:
Next, for simulation, we need to set up a TLB page mapping for all host memory accessed by the accelerator, so that the dmaLoad
/dmaStore
functions can load the correct data into the scratchpads. This is done with the mapArrayToAccelerator
API.
Finally, we need to update the call to elementwise_add
and wrap it with the invokeKernel
function, so that in simulation, gem5-Aladdin will know to fire up the hardware accelerator model instead of just running the kernel function on the CPU.
LLVM-Tracer is the instrumentation tool that we will use to generate a dynamic execution trace for Aladdin to consume. It is only compatible with instrumenting C code, not C++, but it can ignore any code with C++ linkage that it sees. This allows us to write in C++ for the vast majority of our code, dropping down into C only for the few kernel functions representing the hardware accelerators.
Adapt our code for use with LLVM-Tracer is simple: just surround the kernel function in an extern "C"
block.
Build your code, make sure it all compiles, and voila! We are ready to generate our trace, set up the simulator configuration files, and start simulating our first operator!