HeMA-MISO: Heterogeneous Memory Architecture for LLM Inference with SW Optimization

hema-miso:-heterogeneous-memory-architecture-for-llm-inference-with-sw-optimization

Note: This research was conducted in the first half of 2025. Some information may be outdated at the time of reading.

Hello, I’m just a beginner at computer architecture. I had a toy project about NPU with a heterogeneous memory system. If I give U some glimpse, it is not a great idea. But I want to show others about my idea, so this is why I post this and signed up ‘dev.to’ community. Anyway, have fun with my idea! 🙂

Overview

The inference process for Large Language Models (LLMs) places a significant burden on memory systems in several ways.
First, the low arithmetic intensity of the computations means that operations are often memory-bound, highlighting the critical need for high memory bandwidth. Second, the nature of deep learning involves very large strides in data reuse, making it difficult to effectively utilize SRAM-based caches. Finally, server-scale LLMs that employ long contexts or Mixture of Experts (MoE) architectures often run into sheer memory capacity limitations.
This research focuses on the memory capacity constraint. I think HeMA (Heterogeneous Memory Architecture), a Neural Processing Unit (NPU) structure that leverages heterogeneous memory. To complement this architecture, also introduce MISO (Memory-Informed Software Optimization), a software optimization technique designed to maximize its efficiency.

Motivation

The challenge of memory capacity becomes especially clear with recent Mixture of Experts (MoE) models. The table below details several prominent MoE models, showing the number of experts, the proportion of total parameters that the experts occupy, and the number of TPUv4 chips required to store the full model in FP8 precision. (TPUv4 is used as a reference because it is a well-documented and respected NPU architecture).

As the table shows, experts account for the vast majority of the model’s parameters. The final column illustrates that if these MoE layers were replaced with standard Feed-Forward Network (FFN) layers (i.e., considering only the activated parameters), the hardware requirements would drop dramatically.

The problem is intensified by the trend toward long context lengths. The table below summarizes the maximum context lengths of recent LLM models.

Modern LLMs are continuously expanding their context windows to handle tasks like multi-turn conversations, video processing, and comprehension of entire technical documents, with models like Llama-4 Scout reaching up to 10 million tokens. This increases the execution time of the memory-bound attention mechanism and adds to the overall memory capacity burden.
One solution to this capacity issue is to scale out by adding more TPUs. However, this approach wastes significant computational power and introduces interconnect bottlenecks, which can become more limiting than memory bandwidth itself, while also incurring high synchronization costs.
Alternatively, one could add high-capacity memory like DDR or LPDDR directly to the NPU. The drawback here is that their lower bandwidth would severely throttle inference throughput.
A more promising approach is to construct a hierarchical memory system that offers both high capacity and high bandwidth by combining HBM (High Bandwidth Memory) and DDR/LPDDR. An existing example of this is the SambaNova SN40L architecture, which uses a two-tiered system of HBM and DDR.

Methodology

HeMA
To address the memory challenges, I think the HeMA (Heterogeneous Memory Architecture) features a two-tiered memory structure.

The HeMA approach involves adding DDR or LPDDR memory to an NPU that already utilizes HBM. To manage these different memory types, a dedicated memory handler core is also introduced.
The resulting HeMA architecture is illustrated below. The diagram on the left is adapted from the SambaNova SN40L paper to show its chip structure.

When a compute unit sends a memory request to the memory handler core (MHC), the MHC uses a comparator to direct the request based on its memory address. Requests destined for the high-bandwidth, upper-tier memory are placed in the “first memory request/data buffer,” while requests for the high-capacity, lower-tier memory go into the “second memory request/data buffer.”
Following this, a router sends the request to the specific memory controller that manages the target channel address. A key feature of the HeMA architecture is that data can be transferred between both memory tiers in cache-line-sized units.

Additionally, as is common with most NPUs, the compute unit can only directly access data that is in its Scratch Pad Memory (SPM). (NPUs that can access DRAM directly are less common, with Tenstorrent being a notable exception).

MISO: Memory-Informed Software Optimization

So, how can we best leverage the HeMA architecture? This question leads us to the MISO technique. I started from two key observations:

  • LLM inference operations have low data reusability and long reuse intervals, which results in poor utilization of the NPU’s cache, the Scratch Pad Memory (SPM).
  • When the system is accessing data from the slower DDR/LPDDR memory, the high-bandwidth pathways of the HBM and SPM are left idle.

MISO is designed to exploit this inefficiency.

To understand the problem, consider a standard LLM inference operation on data stored in DDR. First, a chunk of data is moved to the SPM, and the compute units begin processing it. With double buffering, the system would simultaneously start fetching the next chunk of data needed for the subsequent operation.

However, due to the low bandwidth of DDR and the high speed of LLM computations, the work on the current data in the SPM finishes long before the next chunk of data arrives. The compute units process the available data faster than the DDR can supply it, leading to stalls and wasted cycles.

Critically, while the system is stalled waiting for DDR, the high-bandwidth HBM is sitting completely idle. MISO is designed to capitalize on this opportunity.
As shown in the diagrams, MISO proposes a novel way to utilize the SPM. The core idea is to partition the SPM:

  • A small, limited portion of the SPM is reserved for the ongoing computation using data from the DDR.
  • The remaining, larger portion of the SPM is used to prefetch the data needed for the next major operation directly from the high-bandwidth HBM. This is the HBM-prefetch.

This prefetching happens in parallel, hiding the data transfer latency. Once the computation on the last chunk of DDR data is complete, the SPM partition reserved for DDR is released, and the entire SPM becomes available for the now-prefetched data from HBM.

However, for this prefetching method to be effective, three key conditions must be met:

  1. Minimal Performance Impact: Restricting the SPM area available for DDR-based operations must cause little to no performance degradation for those operations themselves.
  2. Strategic Data Placement: The data must be laid out in memory so that the model alternates between computations using data from DDR and computations using data from HBM. This ensures both memory systems are utilized in sequence, creating opportunities for overlapping operations.
  3. Maximized Prefetch Capacity: The SPM partition reserved for DDR data must be made as small as possible to maximize the space available for the HBM-prefetch. To achieve this, techniques like operator fusion and tiling must be aggressively applied to the computations running on DDR data.

Exploring MISO: Experimental Setup

To validate the MISO technique, I defined a hypothetical HeMA-TPU architecture with the following specifications.

I started with a baseline TPUv4 and added 128 GB of DDR memory, which provides a bandwidth of 256 GB/s. (I chose DDR over LPDDR because while NPU architectures combining HBM and DDR exist, I have not yet seen commercial examples of an HBM-LPDDR combination) . All experiments on the HeMA-TPU were run using the NPU simulator, ONNXim.

For the workload, I selected the Mixtral 8x7B model, assuming FP8 precision for all model parameters. I chose Mixtral because its specifications are similar to the Llama-2 7B model used in experiments for the RNGD chip, a single-chip solution with comparable specs that was tested with a batch size of 8 and a sequence length of 4096. This allowed us to test a similarly sized model that also incorporates MoE layers. Thanks to its high memory capacity, our HeMA-TPU can run the experiment on a single chip with a sequence length of 4096 and a batch size of 64.
The specifications for the Mixtral model used are as follows.

Exploring MISO: Data Placement

With the architecture defined, I then needed to decide which data to store in DDR and which in HBM.
To inform this decision, I first measured the performance impact of memory bandwidth on the key operations. As the graph below illustrates, using HBM for both attention and FFN operations resulted in nearly identical performance gains of about 4.6x compared to using DDR. This result suggested that there would likely be no significant performance difference whether I prioritized storing the KV cache or the expert weights in HBM.

Proving MISO: Finding the Right Data Placement Strategy

Since placing either the KV cache or the experts in HBM seemed equally viable from a raw performance perspective, I realized the decision was more nuanced. It wasn’t about which data type to prioritize, but how to lay out the data to enable the most efficient processing flow.
This led me to explore three distinct strategies for placing data within a single layer, as shown in the diagram below:

(a) KV-major: In this straightforward approach, I’d prioritize filling the high-speed HBM with the KV cache for as many layers as possible.
(b) Expert-major: This is the opposite strategy, where I’d give the expert weights priority for the HBM space. With both of these methods, however, capacity limits mean that some layers will inevitably be pushed entirely to the slower DDR memory, holding both their KV cache and experts there.
(c) MISO: This is the strategy I developed. Instead of prioritizing one data type over the other, the MISO approach ensures that for every single layer, a portion of its KV cache and a portion of its experts reside in HBM.

Here’s how these different placement strategies affect the actual data flow during inference:

  • With the KV-major approach, the attention operation can pull data from fast HBM, but the subsequent MoE operation is forced to use the slow DDR.
  • The Expert-major strategy results in the opposite problem: the attention operation is stuck with DDR, while the MoE operation gets to use HBM.
  • The MISO approach, however, allows both the attention and MoE operations within a single layer to utilize HBM and DDR simultaneously. This is the key to enabling the data prefetching I was aiming for.

Exploring MISO: The Power of Tiling

The next critical step was to prove that MISO’s core idea was feasible. Remember, the strategy relies on limiting the Scratch Pad Memory (SPM) available for DDR-based operations to free up space for prefetching from HBM. The big question was: can I do this without tanking performance?
I started by evaluating the attention operation. To make this work, aggressive tiling is a must, which means that using an optimized method like Flash Attention is essential.
The graph below shows the results of this test. I simulated a decoding scenario for a single layer with a 4K sequence length, where the KV cache was located in DDR. I then measured performance as I varied two things:
The attention block size (or “step” in the legend): This refers to the dimension of the tile, specifically the partial sequence length used in the QK computation.
The SPM capacity (the x-axis): This is the amount of SPM I allowed the operation to use.

The results were quite revealing. As you can see from the graph, performance generally improves and converges as the tile (“step”) size gets larger. More importantly, I found that I could restrict the SPM capacity all the way down to 8MB without any performance degradation, except for the very largest step size of 2048. (Performance for the 1024-step did begin to drop once I limited the SPM to 4MB) .
This was exactly the result I was hoping for. It confirmed that I could free up a significant portion of the SPM during attention operations with minimal impact.
Based on this, I formulated the following strategy for MISO:
I start the attention calculation using a large step size of 2048 with no SPM limit to maximize performance.
Then, to hide the latency of prefetching the expert weights from HBM for the next operation, I switch tactics for the final phase of the attention calculation. I reduce the step size to 1024 and limit the SPM usage to just 8MB.

What About FFN Operations?

Next, I needed to run a similar analysis for the FFN operations. For FFNs, there is an opportunity to apply operator fusion, as illustrated in the diagram below.

However, this fusion comes with a trade-off. Increasing the tile size (“step”) allows for the fusion of the first two operations (FC0 and FC1), but it makes it difficult to fuse the third operation (FC2).

This trade-off is reflected in the performance results shown in the graph below. For this chart, I plotted the performance for various step sizes against different SPM capacity limits. The performance here is normalized against a baseline that uses the entire SPM but does not employ any operator fusion.

The results show that a step size of 256 delivers the best performance. However, unlike the attention operation, restricting the SPM capacity for FFNs leads to a clear drop in performance.

My analysis showed that it’s possible to limit SPM usage for DDR-based operations with acceptable trade-offs, which is the key to enabling my MISO prefetching technique. The final dataflow works like this:

The strategy is to create a cycle of overlapping computation and memory transfers.

  • While the system is performing the Attention operation on data from DDR, it simultaneously prefetches the expert weights for the next MoE operation from HBM into the now-vacant part of the SPM.
  • Then, while the system is busy with the MoE operation, it uses the free SPM space to prefetch the KV cache for the subsequent layer from HBM.

Crucially, I only apply this SPM-limiting and prefetching strategy on the very last block of KV cache data and expert data that reside in DDR before a switch to an HBM operation.
Of course, to make this work, the other operations in the pipeline also have to be efficient with their memory usage. The QKV generation and Gating operations must also run within a limited SPM footprint. Both of these are essentially matrix multiplication operations, and the chart below shows how their performance varies with different tile sizes.

The performance heatmap makes it clear that the optimal tile size for these operations is around j=256 and k=1024. Crucially, running with this tile configuration does not require a large amount of SPM. This confirms that the QKV generation and Gating operations can comfortably run within the small, restricted SPM partition I set aside for DDR-based tasks, completing the MISO pipeline

Evaluation Setup: Putting MISO to the Test

With the methodology fully developed, it was time to measure its performance. To do this, I set up a comparison between MISO and three other approaches:
Baseline (KV-major): This method uses a default tiling strategy from the ONNXim simulator called “gemmini gemm mapping”. This mapping is not well-suited for the narrow GEMM or GEMV operations that are common in LLM inference.
KV-major (Optimal Tiling): This is the KV-major data placement strategy, but it uses the high-performance tiling approach I identified in my earlier experiments.
Expert-major (Optimal Tiling): Similarly, this uses the Expert-major data placement with the same optimal tiling.
MISO: My proposed method, which combines the MISO data placement strategy with optimal tiling and data prefetching.

In addition to the Mixtral 8x7B model from the initial analysis, I also used the much larger Llama4-Scout to see more long contexts.

Here are the key conditions for the evaluation:

  • All tests were run on the generation (decoding) stage of inference.
    All four software methods were executed on my proposed HeMA-TPU architecture.
  • The Mixtral experiments used 1 HeMA-TPU, while the larger Llama4-Scout model required 4 HeMA-TPUs.
  • For the MoE gating mechanism, I assumed a balanced load, where tokens are distributed evenly across all experts.

Evaluation Results

Performance on Mixtral-8x7B

First, let’s look at the performance on the Mixtral-8x7B model. The bar chart below shows the throughput (in tokens per second) for the different methods. The x-axis represents the input configuration in a (Sequence Length – Batch Size) format.

The percentages shown above the MISO bars represent the performance uplift compared to whichever of the other two optimized methods (KV-major or Expert-major) performed better for that specific input.
As the chart clearly shows, my MISO technique consistently achieves the highest throughput. However, it’s important to be transparent that the largest jump in performance comes from applying optimizations like Flash Attention, which the naive baseline lacks.

You can also see a distinct trend: the performance advantage of MISO shrinks as the input sequence and batch size grow larger. This is expected. As the inputs get bigger, the total size of the KV cache increases, while the SPM capacity available for my prefetching technique remains fixed. This reduces the relative impact of the prefetch.

Performance on Llama4-Scout
Next, I ran the same comparison on the much larger Llama4-Scout model. Here are the results.


The Llama4 results show the exact same trend. For an input with an extremely large KV cache (a 128K sequence length with a batch size of 32), the performance gain from the MISO prefetch is a modest 2.2%.

A Deeper Dive: Latency Breakdown
To better understand exactly where these performance gains were coming from, I performed a latency breakdown for each individual operation. The graph below compares the naive, Expert-major, and MISO methods. For each operation, the latency is normalized to the fastest result among the three.


The breakdown reveals the trade-offs at the heart of the MISO strategy.
First, it’s obvious that both MISO and Expert-major, with their optimized tiling and use of Flash Attention, are dramatically faster than the naive approach across the board.
The more interesting comparison is between MISO and Expert-major. For operations that rely on data from the slow DDR memory (DDR attention and DDR MoE), MISO is slightly slower than Expert-major. This is the expected cost of restricting the SPM size for those specific operations. However, for the operations that use data from the fast HBM, the story is reversed. Because MISO has already prefetched the necessary data into the SPM, it achieves lower latency on HBM-based attention and MoE operations compared to Expert-major. This is the win that ultimately puts MISO ahead in overall throughput.

Limitations

While I’ve demonstrated the effectiveness of MISO-prefetch, my research has a couple of key limitations.

  • Diminishing Returns with Scale: The performance advantage of MISO shrinks as the input size increases. This is because the SPM, which is critical for prefetching, has a fixed size, and its impact becomes less significant as the total KV cache size grows.
  • Small Batch Inefficiency: My MISO data placement strategy, which reserves HBM space for experts in every layer, could be inefficient for very small batch sizes. If a batch is so small that some experts are never selected by the gating mechanism, the premium HBM space allocated to them is wasted, which could degrade overall system performance
Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Post
part-91:-google-kubernetes-engine-(gke)-–-node-pools-&-node-selectors

Part-91: 🚀 Google Kubernetes Engine (GKE) – Node Pools & Node Selectors

Related Posts