Avoiding performance-killing NPU design pitfalls

Quadric’s general-purpose neural processing unit (GPNPU) future-proofs ML designs.

Nearly every new system-on-chip (SoC) design start is incorporating machine learning (ML) inference capabilities. The applications are widespread—smartphones, tablets, security cameras, automotive, wireless systems, and more. Silicon design teams are scrambling to add ML processing capabilities to the central processing units (CPUs), digital signal processors (DSPs), and graphic processing units (GPUs) that they’re already deploying. However, they’ve hit a wall with current processors and are now looking at neural processing units (NPUs).

Why is this so challenging? ML workloads are very different from the workloads that run on existing processors. CPUs are good at running many simultaneous threads of random control code with random memory accesses. DSPs are designed for performing vector mathematics on 1D and 2D arrays of data. GPUs are designed to draw polygons in graphics applications. However, ML inference workloads do not neatly fit into these older architectures. This is because ML inference workloads are dominated by matrix computations (convolutions) on N-dimensional tensor data.

What is an NPU?

Typically, design teams and IP vendors have tried to solve this problem by force-fitting the new workloads onto the old platforms. By analyzing existing ML benchmarks to identify the most frequently occurring major computation operators in ML workloads, they have built offload engines (accelerators) that efficiently execute those select compute building blocks. These offload engines are often called NPUs but, unlike CPUs, they are not software programmable. Instead, they are essentially large arrays of hard-wired, fixed-point multiply-accumulate (MAC) blocks running in parallel.

The idea behind these accelerators is that, if the 10 or 20 most common ML graph operators represent 95% to 98% of the computational workload, then offloading these common operators allows the fully programmable CPU or DSP to execute the rest of the graphs, including any rare or unusual operators in the ML graph. This division of labor is often called “Operator Fallback” because the vast majority of computation runs on the non-programmable NPU, while the program “falls back” to the fully programmable CPU or DSP as required.

The Biggest Pitfall

The biggest pitfall—actually, the fatal flaw—of this approach is the huge assumption made that Operator Fallback is rare and not all that important. However, this discounts the upfront engineering effort required to manually partition these tasks. Additionally, it discounts the performance hit involved. By taking a closer look at this approach, it becomes obvious that Operator Fallback actually needs to be avoided at all costs.

Consider the example of an SoC with a large, general-purpose CPU, a vector DSP engine for vision processing, and a 4TOP/s ML accelerator (where TOPS means tera, or trillion, operations per second). The compute resources available in each engine are shown in the table.

A matrix operation running on the accelerator is screaming fast, taking advantage of all 2048 multiply-accumulate units in the NPU accelerator. However, similar operators running on the DSP are 32X slower. On the CPU, with only 16 MACs, these operators are 128X slower.

A Huge Performance Bottleneck

Even if only 5% of the total computation of a ML workload needs to fall back onto the CPU, that small 5% becomes a huge performance bottleneck for the entire inference execution. If 98% of the computation blazes fast on the accelerator and the complex SoftMax final layer of the graph executes 100x or 1000X slower on the CPU, the entire inference time is dominated by the slow CPU performance.

Fallback only gets worse over time because machine learning is rapidly evolving. While the silicon being designed today will enter volume production in 2025 or 2026, today’s reference models will be replaced by newer, more complex, and more accurate ML models. Those new models, in three years’ time, will likely feature new operator variants or new network topologies, necessitating even more fallback onto the slow CPU or DSP. Total performance will degrade even more, making the chip design underperform or maybe even inappropriate for the task.

But that’s better than going without an NPU, right? Well, yes, but there is a different approach that needs to be considered. This approach eliminates the aforementioned pitfalls by making the NPU accelerator just as programmable as the CPU or DSP. Furthermore, it must be programmable in C++, so engineers can easily add operators as ML tasks evolve.

Making a NPU Programmable

Designing a programmable NPU is not for the faint of heart. It’s incredibly more complex than just designing a hardware accelerator for the most common operators. It must be able to execute diverse workloads with great flexibility—all on a single machine. Ideally, it makes having a CPU and/or DSP unnecessary, as it can execute those code streams efficiently.

Quadric’s Chimera general-purpose neural processing unit (GPNPU) enables hardware developers to instantiate a single core that can handle an entire ML workload, along with the typical DSP pre-processing and post-processing, signal-conditioning workloads that are often intermixed with ML inference functions. Dealing with a single core dramatically simplifies hardware integration and eases performance optimization. Furthermore, system design tasks such as profiling memory usage to ensure sufficient off-chip bandwidth are greatly simplified.

Simplified Software Development

The Chimera GPNPU also significantly simplifies software development since matrix, vector, and control code can all be handled in a single code stream. ML graph code from common training toolsets (TensorFlow, PyTorch, ONNX) is compiled by the Quadric toolset and merged with signal processing code written in C++, all compiled into a single code stream running on a single processor core. The Chimera SDK enables the mixing and matching of any data parallel algorithm, irrespective of whether it is expressed as a machine learning graph or as traditional C++ code.

The Chimera GPNPU, which is available in 1 TOPS, 4 TOPS, and 16 TOPS variants, is fully C++ programmable by the software developer. New ML operators can be quickly written and run just as fast as the “native” operators written by Quadric engineers. The result is no Fallback, only fast execution! This future-proofs ML designs, no matter what new forms of operators or graphs the future brings.

www.quadric.io

Avoiding performance-killing NPU design pitfalls

Related Post

Timing devices in automotive applications

Taming incremental rotary encoders

Galvanic decoupling is robust and easy