SMAUG
Simulating Machine Learning Applications on gem5-Aladdin
Building a custom operator

In this document, we will describe how to build a custom operator with a custom hardware accelerator model implementing the logic. Our custom operator will perform an element-wise add of two tensors.

SMAUG backends

A backend is a way to logically group together a set of related operators and/or enforce shared properties on instantiations of operators. For example, a backend may logically require that operators share a common set of compute resources/global variables, impose the same zero-padding requirements on data, and more. SMAUG ships with two backends:

  • Reference: reference implementations of all operators supported in SMAUG. These are intended to be correct, not fast.
  • SMV: operators implementations based on the SMV chip taped out by the Harvard Architecture, Circuits, and Compilers research group in 2018. These are models of accelerators with 8-wide 16-bit vectorized datapaths. The SIMD datapaths require data to be properly aligned first.

In SMAUG, a Backend is represented as a class comprised purely of static functions and variables. They are defined in core/backend.h and core/backend.cpp. Backend classes are used as template parameters to Operator subclasses, so they must be statically interchangeable. Thus, all backend definitions must statically define the same set of functions and variables, which means that they must also support every operator type.

After building your custom Operator, you will need to include and register the new operator in those files. We will discuss this more once we get to that step.

The Operator class

When SMAUG reads the model topology proto, it creates named Operator objects and places them in a global Workspace. Any Tensor or Operator can be looked up by name in the workspace. By convention, SMAUG first creates an empty Operator of the appropriate type with a common constructor signature, then uses type-specific setters to fill in all the parameters. After all operators are constructed, SMAUG automatically adds edges in the graph to link dependent operators together. For example, here is a typical operator construction pattern (see network_builder.cpp for more examples):

ConvolutionOp<Backend>* op = Backend::createConvolutionOp(name, workspace);
op->setWeightDims(1,2,3,4);
op->setPadding(smaug::PaddingType::SAME);
// Set the remaining operator parameters...
network->addOperator(op);

Note that operator constructors are invoked by a Backend::createXXXOperator function (created when registering a new operator in the backend). Every Operator's constructor must accept the same two arguments: name and workspace, and it must invoke the parent class's constructor.

More importantly, note that at construction time, we do not set or create tensors as parameters to operators. Instead, we set the dimensions of tensors and create than at a later time. Here, we provided a setter for the dimensions of a 4D convolution's weights - filter size (1x2x3) and number of output feature maps (4). But we do not set the dimensions for the input or output activation tensors. The dimensions of the input tensor depend on the previous operator in the graph, and the dimensions of the output in turn depends on the input. At operator construction time, these relationships are not yet known.

Once all operators are constructed, how does SMAUG connect an output tensor of operator A to the input tensor of operator B? What happens if operator B has many input tensors, each of which have different meanings? The answer is that the base Operator class contains an ordered list of inputs and outputs. Each operator implementation publishes the number of inputs and outputs it has along with the meaning of each one (e.g. input tensor 0 represents activations and input tensor 1 represents weights). This ordering is reflected in to the Python API and encoded in the model topology proto. SMAUG uses this information to link operators together with the Operator::setInput and Operator::setOutput APIs. This information is typically encoded as enums:

enum {kInput0, kInput1, kNumInputs};
enum {kOutput, kNumOutputs};

Putting this all together, below is a simple example of a custom Operator that has no backend-specific behavior. Place this code into smaug/operators/my_custom_operator.h.

#include "core/operator.h"
#include "core/workspace.h"
namespace smaug {
template <typename Backend>
class MyCustomOperator : public Operator {
public:
MyCustomOperator(const std::string& name, Workspace* workspace) :
Operator(name, workspace) {
inputs.resize(kNumInputs, nullptr);
outputs.resize(kNumOutputs, nullptr);
}
void setParam1(int val) { param1 = val; }
void setParam2(int val) { param2 = val; }
// A required function that implements the actual Operator logic. Leave this
// blank for now.
void run() override {}
// Optional override for testing purposes.
void createAllTensors() override {}
// Optional but recommended function to verify operator parameters.
bool validate() override {}
// An optional function to tile the input tensors.
void tile() override {}
enum {kInput0, kInput1, kNumInputs};
enum {kOutput, kNumOutputs};
private:
int param1 = 0;
int param2 = 0;
};
} // namespace smaug

Now we can integrate this custom operator into SMAUG. To do so, we need to make a few more modifications:

  1. Add a new OpType enum for this operator to smaug/core/types.proto.
  2. Define the operator in all backends. Simply follow the existing convention in backend.h and backend.cpp:
    • Include the header file and forward declare the operator in backend.h.
    • Add DECL_CREATE_OP(MyCustomOperator) to all backends in backend.h.
    • Add DEF_CREATE_OP(MyCustomOperator, Backend) for all backends in backend.cpp.
  3. Update network_builder.cpp to know about the new operator. This belongs in createAndAddOperator:
    if (type == OpType::MyCustomOperator) {
    auto op = Backend::createMyCustomOperator(name, workspace);
    op->setParam1(node.param1());
    op->setParam2(node.param2());
    network->addOperator(op);
    }
  4. Add any new .cpp files to the SRCS variable in smaug/make/Makefile.common.

In order to use your new operator in a model, you also need to add an API to create it in the Python API. See the Python documentation for details.

Implementing the operator logic

We've written the skeleton of a new custom operator, but it currently doesn't do anything. Our custom operator is supposed to take two tensors and add them elementwise. In this section, we'll learn how to implement this. We'll first write and test a CPU-only implementation (no interaction with Aladdin) to familiarize ourselves with SMAUG APIs. Afterwards, we'll modify this to work with the gem5-Aladdin family of tools.

Software-only implementation

The first step of implementing the actual operator is to create the tensors to store the output. In practice, the Python API will compute shapes for all Tensors, and the network builder will handle creation and population of Tensor objects into each Operator. However, for testing purposes, we also implement a createAllTensors virtual function to do this all in a single step. For an elementwise add, the output tensor's shape is the same as the inputs.

void createAllTensors() override {
Tensor* output = new Tensor(name, inputs.at(Input0)->getShape());
output.at(kOutput) = output;
workspace->addTensor(output);
}

We should also verify that the inputs to our operator match our expectations. There are several common properties to validate:

  1. Tensor shapes.
  2. Data layout. The operator must support the order of the dimensions in which the elements of the tensor are arranged.
  3. Data type. The operator implementation (which represents a hardware model) must have explicit support for whatever data types are desired.

In our example, an elementwise addition requires that the two input tensors be of the same shape, the data type to be single-precision float, but supports all data layouts. It doesn't matter whether the data is stored as NCHW/NHWC/NC, because the operation is elementwise.

This validation is provided by a validate API which runs after the network is fully constructed:

bool validate() override {
Tensor* input0 = getInput(kInput0);
Tensor* input1 = getInput(kInput1);
return (input0.getShape() == input1.getShape() ||
input0.getDataType() != DataType::Float32 ||
input1.getDataType() != DataType::Float32);
}

Now, we can write the run function which implements the operator's function itself.

void elementwise_add(float* input0, float* input1, float* output, int size) {
for (int i = 0; i < size; i++) {
output[i] = input0[i] + input1[i];
}
}
void run() override {
Tensor* input0 = getInput(kInput0);
Tensor* input1 = getInput(kInput1);
Tensor* output = getOutput(kInput1);
// Get handles to the actual underlying data storage. This performs a
// dynamic_cast to the specified data type, which we verified is safe inside
// validate().
float* input0Data = input0->data<float>();
float* input1Data = input1->data<float>();
float* outputData = output->data<float>();
elementwise_add(input0Data, input1Data, outputData, output.getShape().size());
}

Test it out

With the implementation complete, let's try it out with a unit test. SMAUG uses the Catch2 framework for unit testing, and the SmaugTest fixture provides a range of useful testing utilities. Open up a new cpp file (my_custom_operator_test.cpp):

#include "catch.hpp"
#include "smaug/core/backend.h"
#include "smaug/core/tensor.h"
#include "smaug/operators/my_custom_operator.h"
using namespace smaug;
TEST_CASE_METHOD(SmaugTest, MyCustomOperator, "[ops]") {
// DataLayout::NC is a simple 2D layout, where N = batches and C = a column
// of data.
TensorShape shape(1, 10, DataLayout::NC);
Tensor* input0 = new Tensor("tensor0", shape);
// Allocate the memory for a 1x10 array of floats.
input0->allocateStorage<float>();
// Add some testing data.
input0->fillData<float>({1,2,3,4,5,6,7,8,9,10});
workspace()->addTensor(input0);
// Repeat this for a second input tensor.
Tensor* input1 = new Tensor("tensor1", shape);
input1->allocateStorage<float>();
input1->fillData<float>({2,3,4,5,6,7,8,9,10,11});
workspace()->addTensor(input1);
// Create the operator and fill it with our tensors.
using TestOp = MyCustomOperator<ReferenceBackend>;
auto op = new TestOp("eltwise_add", workspace());
op->setInput(input0, TestOp::kInputs0);
op->setInput(input1, TestOp::kInputs1);
op->createAllTensors();
// Allocates memory for all the output tensors created by createAllTensors.
allocateAllTensors(op);
op->run();
// Compare the output of the operator against expected values.
std::vector<float> expected_output = {3,5,7,9,11,13,15,17,19,21};
Tensor* output = op->getOutput(TestOp::kOutput);
// This performs an approximate comparison between the tensor's output and
// the expected values.
verifyOutputs(output, expected_output);
}

Add your new test to the TESTS variable in make/Make.common. Then build the unit tests with make tests and run ./smaug/operators/my_custom_operator_test.

Hardware-accelerated implementation

Now that our software-only implementation is working, we are going to adapt it to serve as an Aladdin model for a hardware accelerator. This is a multi-step process:

  1. Select the design parameters for your accelerator memory system. In particular, how much local scratchpad space will the accelerator have to work with?
  2. Based on the answer to #1, tile your input and output tensors so that they fit within the allocated scratchpad space.
  3. Write a scheduler function to iterate over each of the tiles, copying data into/out of the local accelerator scratchpad space, and running the operator function. This is also commonly known as writing a loop nest.
  4. Use the provided gem5-Aladdin APIs to copy data and invoke the kernel function, so that during simulation, the simulator will invoke the Aladdin accelerator instead of running the function on the CPU.
  5. Adapt the accelerated function (also called the kernel function) so that it can be traced by LLVM-Tracer, the LLVM-based instrumentation tool that generates the Aladdin dynamic trace.

Accelerator design parameters

Accelerators have a limited amount of local memory that can be directly accessed. When offloading work to an accelerator, any data it needs must be copied into this local memory first before computation can begin. If the input data is larger than this local memory, that data must be tiled into smaller pieces that can fit. How much memory to allocate is a design tradeoff between performance and hardware cost. This is a tradeoff that SMAUG can help researchers investigate for a particular workload.

A piece of accelerator-local memory is represented in SMAUG as a global array that is only accessed within the accelerated function. These are typically declared under the appropriate namespaces in backend.h and defined in backend.cpp. For example, the SMV backend has three scratchpads (spad0, spad1, and spad2). spad0 and spad1 are typically used for inputs, while spad2 is used for outputs, but this is merely a convention.

For our custom operator, let's add two scratchpads to the Reference backend. Open up backend.cpp and add two array definitions as shown below. Also, add a unique integer ID for this custom operator. We'll use it later when invoking the accelerator in simulation.

namespace ref {
// This is all existing code...
const unsigned kConvolutionHw = 0x0001;
const unsigned kInnerProductHw = 0x0002;
const unsigned kEltwiseOpHw = 0x0003;
const unsigned kBatchNormHw = 0x0004;
const unsigned kPoolingHw = 0x0005;
// Define our new scratchpads here.
int kSpadSize;
float* spad0;
float* spad1;
// Add a unique ID for our accelerator HW. This will be used to invoke the
// accelerator during simulation.
const unsigned kMyCustomOperatorHw = 0x00006;
} // namespace ref

Then, open up backend.h and add extern declarations for them:

namespace ref {
// This is all existing code...
extern const unsigned kConvolutionHw;
extern const unsigned kInnerProductHw;
extern const unsigned kEltwiseOpHw;
extern const unsigned kBatchNormHw;
extern const unsigned kPoolingHw;
// Declare our two new global arrays and accelerator IDs here.
extern int kSpadSize;
extern float* spad0;
extern float* spad1;
extern const unsigned kMyCustomOperatorHw;
} // namespace ref

We need to allocate memory for these new arrays before they can be used. By this point, you should have picked a scratchpad size. Let's say you picked 32KB. In backend.h, modify ReferenceBackend like so:

class ReferenceBackend {
static int SpadSize() { return ref::kSpadSize; }
static void initGlobals() {
ref::kSpadSize = 32*1024; // Replace with your actual value.
ref::spad0 = (float*) malloc_aligned(ref::kSpadSize);
ref::spad1 = (float*) malloc_aligned(ref::kSpadSize);
}
static void freeGlobals() {
free(ref::spad0);
free(ref::spad1);
}
}

Your accelerator-local scratchpads have now been correctly configured. We will modify our kernel function to use these shortly.

Tile your Tensors

Writing logic to correctly tile your tensors can be tricky. There are lots of corner cases and special features of the data that must be taken into account so that the accelerator can still compute the correct output with partial data. To ease this, SMAUG provides a library of useful tiling functionality that you can reuse. For more information on the SMAUG tiling optimizer design, refer to Tiling optimizers in SMAUG.

Fortunately, tiling a tensor for an elementwise operator is the easiest kind of tiling there is - no need to consider data reuse, striding, etc. We simply need to break up the two input tensors and output tensor into equal sized pieces which maximize the use of the local scratchpads. This can be accomplished with just a few lines of code in my_custom_operator.h by overriding the tile() function and calling the smaug::generateTiledTensorPerBatchNC function. The result will be an array of TiledTensor objects. A TiledTensor represents a spatially ordered collection of Tensor objects, each containing a copy of a slice of data from an underlying Tensor.

#include <array>
class MyCustomOperator {
/* Existing code... */
public:
void tile() override {
auto inputs0 = getInput(kInput0);
auto inputs1 = getInput(kInput1);
auto outputs = getOutput(kOutput);
// The simplest tiling strategy is to tile per batch. Each tile will have a
// size of at most 1 x maxTileSize.
int maxTileSize =
std::min(ReferenceBackend::SpadSize() / inputs0->getDataTypeSize(),
inputs0->getShape().storageSize());
TensorShape tileShape(
{ 1, maxTileSize }, DataLayout::NC, ReferenceBackend::Alignment);
// The final bool parameter specifies whether to copy the data from the
// source tensor into each of its tiles. Obivously, we want to do this for the
// input tensors, but the output tensor is empty, so there's no need to
// waste time on that.
tiledTensors[0] = generateTiledTensorPerBatchNC(inputs0, tileShape, this, true);
tiledTensors[1] = generateTiledTensorPerBatchNC(inputs1, tileShape, this, true);
tiledTensors[2] = generateTiledTensorPerBatchNC(outputs, tileShape, this, false);
}
private:
// Because tensor tiling is done at the start of the program (before the
// operator starts running), these tiles need to be stored in memory for use
// later.
std::array<TiledTensor, 3> tiledTensors;
}

Modify run() to loop over all our tiles

This part is straightforward. Iterate over each of the tensor tiles and run the elementwise_add function on their contents. Finally, flatten the tiled output tensor back into a single one. Since this is an elementwise operation, there is no need to consider data reuse, so a simple for loop will do. For more complex operators in which data reuse is a critical factor to optimize for, changing the order of iteration along the dimensions may greatly affect performance.

void run() override {
TiledTensor& input0 = tiledTensors[kInput0];
TiledTensor& input1 = tiledTensors[kInput1];
TiledTensor& output = tiledTensors[kOutput];
for (int i = 0; i < input0.size(); i++) {
Tensor* input0Tile = input0.getTileWithData(i);
Tensor* input1Tile = input1.getTileWithData(i);
Tensor* outputTile = output.getTileWithData(i);
// Get handles to the actual underlying data storage. This performs a
// dynamic_cast to the specified data type, which we verified is safe inside
// validate().
float* input0Data = input0Tile->data<float>();
float* input1Data = input1Tile->data<float>();
float* outputData = outputTile->data<float>();
elementwise_add(input0Data, input1Data, outputData, outputTile.getShape().size());
}
// The results of the elementwise_add are stored in the tiled tensor. We need
// to merge the data from the individual tiles back into a single contiguous
// Tensor.
flattenTiledTensor(tiledTensors[kOutput], outputs);
}

As usual, let's add a unit test to verify that this implementation is correct. We need to ensure that our inputs are larger than the scratchpads we created in order for tiling to make any meaningful change. Add this code to your existing unit test.

// A function to fill the tensor with a sequence of monotonically increasing
// data, starting from 0. Note that this is ONLY advised for elementwise/unary
// operators in which we don't care about data in specific dimensions.
void fillTensorWithSequentialFloat32Data(Tensor* tensor) {
float32* data = tensor->data<float>();
for (int i = 0; i < tensor->getShape().getStorageSize(); i++) {
data[i] = i;
}
}
TEST_CASE_METHOD(SmaugTest, MyCustomOperatorWithTiling, "[tiling]") {
// With float32 elements, this will occupy 128KB, which should create four
// tiles per tensor.
TensorShape shape(8, 4096, DataLayout::NC);
Tensor* input0 = new Tensor("tensor0", shape);
Tensor* input1 = new Tensor("tensor1", shape);
workspace()->addTensor(input0);
workspace()->addTensor(input1);
// Create the operator and fill it with our tensors.
using TestOp = MyCustomOperator<ReferenceBackend>;
auto op = new TestOp("eltwise_add", workspace());
op->setInput(input0, TestOp::kInputs0);
op->setInput(input1, TestOp::kInputs1);
// This will handle creating/allocating storage/filling data into all the
// input tensors.
createAndFillTensorsWithData(op, &fillTensorWithSequentialFloat32Data);
// Compute the expected output.
std::vector<float> expected_output(8*4096, 0);
for (int i = 0; i < expected_output.size(); i++) {
expected_output[i] = 2*i;
}
op->tile();
op->run();
Tensor* output = op->getOutput(kOutput);
verifyOutputs(output, expected_output);
}

Use gem5-Aladdin APIs to copy data and invoke the kernel

From the perspective of the elementwise_add function, we've been writing our code as if it could directly access the contents of the input0, input1, and output arrays. These are pointers into host memory. But if it's representing a block of hardware, and the hardware can only access local scratchpads, we need to copy the contents of the host memory into the scratchpads. In gem5-Aladdin, this is accomplished with special functions: dmaLoad and dmaStore. They work just like regular calls to memcpy, except LLVM-Tracer will recognize this call as a special function call and have Aladdin handle it appropriately. This is handled as follows:

// By convention, we prefix all pointers into host memory with "host_".
void elementwise_add(float* host_input0,
float* host_input1,
float* host_output,
float* spad0,
float* spad1,
int size) {
// Copy input data from host_inputN to spadN. The first argument to dmaLoad
// or dmaStore is always the destination.
dmaLoad(spad0, host_input0, size);
dmaLoad(spad1, host_input1, size);
for (int i = 0; i < size; i++) {
// Accumulate the data from spad0 into spad1.
// NOTE: This could be optimized more if we had three scratchpads instead
// of two. This would be a great exercise for the reader :)
spad1[i] += spad0[i];
}
// Copy output data from spad1 back to the host.
dmaStore(host_output, spad1, size);
}

Next, for simulation, we need to set up a TLB page mapping for all host memory accessed by the accelerator, so that the dmaLoad/dmaStore functions can load the correct data into the scratchpads. This is done with the mapArrayToAccelerator API.

Finally, we need to update the call to elementwise_add and wrap it with the invokeKernel function, so that in simulation, gem5-Aladdin will know to fire up the hardware accelerator model instead of just running the kernel function on the CPU.

void run() override {
TiledTensor& input0 = tiledTensors[kInput0];
TiledTensor& input1 = tiledTensors[kInput1];
TiledTensor& output = tiledTensors[kOutput];
for (int i = 0; i < input0.size(); i++) {
Tensor* input0Tile = input0.getTileWithData(i);
Tensor* input1Tile = input1.getTileWithData(i);
Tensor* outputTile = output.getTileWithData(i);
// Get handles to the actual underlying data storage. This performs a
// dynamic_cast to the specified data type, which we verified is safe inside
// validate().
float* input0Data = input0Tile->data<float>();
float* input1Data = input1Tile->data<float>();
float* outputData = outputTile->data<float>();
input size = outputTile.getShape().size();
// Set up the TLB mappings.
mapArrayToAccelerator(
ref::kMyCustomOperatorHw, // The accelerator ID this TLB mapping is for.
"host_input0", // The name of the function argument in the kernel function.
input0Data, // The pointer to the data.
size // The size of the TLB mapping
);
mapArrayToAccelerator(
ref::kMyCustomOperatorHw, "host_input1"
input1Data, size);
mapArrayToAccelerator(
ref::kMyCustomOperatorHw, "host_output"
outputData, size);
// Wrap the call to elementwise_add with invokeKernel.
invokeKernel(ref::kMyCustomOperatorHw, // our accelerator ID
elementwise_add, // if not simulating, the function to call
// All of the function call arguments.
input0Data,
input1Data,
outputData,
ref::spad0,
ref::spad1,
outputTile.getShape().size());
}
// The results of the elementwise_add are stored in the tiled tensor. We need
// to merge the data from the individual tiles back into a single contiguous
// Tensor.
flattenTiledTensor(tiledTensors[kOutput], outputs);
}

Make the kernel function traceable with LLVM-Tracer

LLVM-Tracer is the instrumentation tool that we will use to generate a dynamic execution trace for Aladdin to consume. It is only compatible with instrumenting C code, not C++, but it can ignore any code with C++ linkage that it sees. This allows us to write in C++ for the vast majority of our code, dropping down into C only for the few kernel functions representing the hardware accelerators.

Adapt our code for use with LLVM-Tracer is simple: just surround the kernel function in an extern "C" block.

#ifdef __cplusplus
extern "C" {
#endif
void elementwise_add(/*...*/) { /* code ... */ }
#ifdef __cplusplus
}
#endif

Build your code, make sure it all compiles, and voila! We are ready to generate our trace, set up the simulator configuration files, and start simulating our first operator!

smaug::Tensor
Tensor represents a single multi-dimensional array of data.
Definition: tensor.h:344
smaug::SmaugTest
The Catch2 test fixture used by all C++ unit tests.
Definition: smaug_test.h:43
smaug::Tensor::data
const T * data() const
Returns a const pointer to the Tensor data.
Definition: tensor.h:521
smaug::malloc_aligned
void * malloc_aligned(size_t size, bool zeroOut)
Return heap-allocated cacheline-aligned memory.
Definition: utils.cpp:9
smaug::Tensor::allocateStorage
T * allocateStorage()
Allocates memory to store Tensor data.
Definition: tensor.h:473
tensor_utils.h
Utility functions for copying/printing/tiling tensors.
smaug::generateTiledTensorPerBatchNC
TiledTensor generateTiledTensorPerBatchNC(Tensor *tensor, const TensorShape &tileShape, Operator *op, bool copyData)
Tile the provided NC Tensor per batch.
Definition: tensor_utils.cpp:199
smaug::TensorShape
TensorShape describes the shape of a Tensor.
Definition: tensor.h:35
smaug::Tensor::fillData
void fillData(T *externalData, int size)
Fills the Tensor with externalData.
Definition: tensor.h:400
smaug_test.h
SMAUG unit test fixture.
smaug
The smaug namespace is the parent namespace of all C++ code in SMAUG.
Definition: backend.cpp:38
common.h
Utilities for writing and invoking Aladdin kernels from Operators.
smaug::invokeKernel
void invokeKernel(int accelIdx, unsigned reqCode, const Kernel &kernel, Args &&... args)
The generic blocking interface for all accelerator kernel functions.
Definition: common.h:72
smaug::flattenTiledTensor
void flattenTiledTensor(TiledTensor &tiledTensor, Tensor *destTensor)
Copies the data from each tile in a TiledTensor into a destination Tensor as a contiguous block of me...
Definition: tensor_utils.cpp:343