SMAUG
Simulating Machine Learning Applications on gem5-Aladdin
|
Due to certain limitations of the syscall-emulation mode of gem5-Aladdin, there are two key differences in how SMAUG runs under simulation compared to on real hardware, which affect how you will write code for SMAUG.
DL models are highly compute-intensive, but oftentimes these core kernels are very repetitive, so we can afford to sample the loop iterations to save on simulation time and trace storage. This means that during tracing and simulation, we will only execute a small fraction of the total loop iterations, and afterwards, we unsample this simulated time to estimate the actual time spent in the kernel.
Caveats of sampling if enabled:
Here's an example of how a loop is instrumented to enable sampling:
By default, sampling is disabled. It is controlled by two flags in SMAUG:
--sample-level=[no|low|med|high|very_high]
. This is a hint to the software that indicates how many levels of a nested loop are sampled. For example, "low" could mean that only the innermost loop is sampled, but everything above it is not affected, whereas "very_high" could mean to sample every layer of a nested loop. Different kernels can implement this differently; this is merely a hint.--sample-num=N
: This informs the loop how many iterations to run (if sampling is enabled). We often set this to 1 or 2. 1 is sufficient for many loops. For pipelined loops, 2 iterations are required to correctly measure the amount of overlap between successive iterations.In gem5's syscall-emulation mode, multithreading has slightly quirky behavior.
--num-cpus=N
to gem5. gem5 will create N ThreadContext
objects to represent the state of each thread. You must ensure that you never try to spawn more than this number of threads, or gem5 will crash with an "out of ThreadContexts" error.ThreadContext
will be destroyed, but that slot cannot be reused. A newly created thread will always attempt to allocate a new ThreadContext
, and if you've already created N ThreadContext
s, any attempts to create more will fail even if CPUs are sitting around idling.To work around these limitations, we have developed a specialized ThreadPool implementation. It ensures that as long as the pool is still alive, none of the threads will exit. An work queue API is exposed for the application to use. Thread scheduling is round-robin and CPUs are deterministically assigned. And to reduce excessive memory traffic from spin-idling threads, we use a magic instruction to quiesce a CPU when there's no work for it to do and a magic wake instruction when a quiesced CPU has work again. When running on real hardware, these instructions are ignored.
To use the thread pool with N threads:
--num-cpus=N+1
to gem5. We need one extra thread to run the main thread!You can see examples of this in action at TiledTensor::parallelCopyTileData.