Peak working memory, not model size, decides whether inference fits in RAM. It is a property of the schedule and the allocation rather than the model alone, which makes it a compile-time planning problem.
Picture a device that has been in the field for years: a sensor node, a metering unit, a small controller. It was designed long before machine learning was part of its requirements, and it cannot be redesigned now. The fleet is deployed, the hardware is fixed, and a board revision is out of the question. What can change is the firmware. So when a model is supposed to be retrofitted onto it, it has to live in whatever memory the existing firmware leaves over. Suppose the model is ready: accurate, already quantized to int8, and the file is small. But deployment still fails. The runtime cannot allocate its working buffers, and the standard advice is to shrink the network until it fits. This post is about why that happens, and why shrinking the model is like sawing the legs off a piano to get it through the door: it fits now, but it is not really the instrument you meant to deliver.
The reason is that model size and memory demand are two different quantities. The size of a model is the size of its parameters. Whether inference runs on a device like this depends on peak working memory: the largest amount of intermediate data alive at any single point during execution. And that peak depends not only on the model, but also on the order its operators run in and on how memory is allocated. Both of these are decisions, not givens, so whether a model fits is a planning problem, solvable ahead of time on a workstation.
1. Model size is the wrong number
A quantized network might occupy a few hundred kilobytes of parameters. On a resource-constrained device this is still not the number that matters, because parameters are immutable. They are written once at build time and only read during inference, so they can stay in flash and be fetched, or executed in place, without taking up RAM.
What has to sit in RAM is what the network writes while it computes: the intermediate tensors between operators, plus the scratch memory some kernels need. These buffers appear and disappear during a single forward pass. What the budget has to hold is the largest set of them that is alive at the same time, not their sum, and not the parameter count. A network with few parameters can have a large activation peak, and a network with many parameters can have a small one. Most deployment failures are the first case: the weights sit comfortably in flash, and the activations do not fit in RAM.
2. The memory classes
It helps to separate the memory an inference workload consumes into classes, because they differ in size, in mutability, and in where they can live.
| Class | Mutable? | Natural home | Reducible by planning? |
|---|---|---|---|
| Weights and constants | No | Flash (read / XIP) | No (read-only; not the binding constraint when flash-resident) |
| Input / output tensors | Yes | RAM | Marginally (must persist across the pass) |
| Intermediate activations | Yes | RAM | Yes, substantially |
| Operator scratch | Yes | RAM (must be fast) | Indirectly (kernel-dependent) |
| Runtime / allocator overhead | n/a | RAM | Yes (eliminable with static plans) |
Most of this post is about intermediate activations: the largest mutable class, and the one planning can do the most about, because their lifetimes are short and overlapping. Operator scratch is different. It depends on which kernel implementation sits behind each operator, not on the graph, so it cannot be planned at this level; the working assumption from here on is that the budget leaves room for the scratch of the kernels you deploy with.
3. Peak memory is a property of the schedule
Model the network as a directed acyclic graph of operators producing tensors. Each intermediate tensor has a lifetime: it is born when its producer runs, and it is dead after its last consumer has run. At any step of an execution order, the live tensors are the ones born and not yet dead. Their total size is the working set, and the maximum over the schedule is what the arena must hold.
This peak is not fixed by the graph. It follows from two decisions:
- The schedule, the order the operators run in, which decides what is live at the same time. An order that interleaves independent branches keeps more tensors live at once than one that finishes a branch before starting the next.
- The allocation, the byte offsets assigned in the arena. Once a tensor is dead its bytes can be reused, so a good offset assignment packs tensors with disjoint lifetimes into the same space.
Graphs with branches are where this matters most. If a tensor feeds both the next operator and a residual connection further downstream, it stays live across the whole branch in between and holds its space, no matter how cheap it was to produce. Minimizing the arena under fixed lifetimes is a packing problem, similar to register allocation, and NP-hard in general; in practice simple greedy heuristics come close to the optimum. The important point is the framing: peak memory is a quantity you can minimize by choosing well, not a constant you measure and accept.
4. Why naive execution wastes RAM
A runtime that allocates tensors as it encounters them, or sizes a flat arena from the natural execution order with conservative lifetimes, wastes memory in two ways. It keeps buffers alive when it cannot prove they are dead, so they stay around past their last use. And it follows an execution order that was chosen for correctness, not for minimal concurrency, so more tensors are live at the same time than necessary. The result is a model that should fit but does not. The usual response is to shrink the network until allocation succeeds, which pays with model quality for what is really a planning problem.
5. Static execution plans
If the schedule and the allocation determine peak memory, both should be decided where compute is cheap: ahead of time, on the workstation. A compiler can read the whole graph, derive exact lifetimes, choose an order, and give every tensor a fixed byte offset in a memory pool, whether on-chip SRAM, external RAM, or flash. The output is a static plan: an ordered list of operator calls over preassigned buffers. The device executes it by walking through the list.
This removes the on-device allocator entirely. No fragmentation, no allocation bookkeeping, no interpreter dispatch; the runtime is a loop over a fixed structure. It also makes on-device memory deterministic: a graph that exceeds the budget fails at compile time on the workstation, not at runtime on the device. These properties matter beyond convenience. Functional-safety standards expect exactly this from embedded software: MISRA C forbids dynamic memory allocation after initialization, and statically bounded, deterministic memory behavior is a precondition for arguing resource safety under standards like ISO 26262 and IEC 61508. A runtime that walks a fixed plan over a preassigned arena is small, and it is also auditable.
The arena size is known before any firmware is flashed. For MobileNetV1 (α = 1.0, 128×128 input, int8), condensed:
$ tigris analyze mobilenet_v1_i8.onnx -m 256K -f 16M
Operators 30
Tensors 114 (31 activations)
Peak memory (naive) 256.00 KiB
Largest tensor 1x64x64x64 (256.00 KiB)
SRAM budget 256.00 KiB, 1 stage PASS - fits in budget
Flash weights 3.09 MiB, plan 3.10 MiB PASS - plan fitsThese numbers show the point from section 1: 3.09 MiB of weights sit in flash and never touch the budget, while fit is decided by a 256 KiB activation peak, dominated by a single 64×64×64 tensor early in the network. The reported peak is an activation figure; kernel scratch is reserved per backend at startup from the same arena, so the assumption from section 2 holds here too: a budget right at the edge of a PASS needs headroom for the kernels you deploy with. With that, the question of fit moves from trial and error on hardware to something you settle statically.
6. When planning is not enough: tiling
Scheduling and allocation rearrange and reuse memory, but they cannot make a single tensor smaller. If one operator’s output, or one layer’s working set, exceeds the budget on its own, no offset assignment will fit it. The remaining lever is the shape of the computation.
Tiling changes that shape. Large intermediates are produced and consumed in pieces instead of all at once, which trades extra execution steps for a smaller peak working set. The weights, the arithmetic, and the numerical result stay identical; only the granularity and order of the computation change. This is the difference between a different execution shape of the same model and a different, smaller model. The first keeps the network you trained; the second does not.
TiGrIS treats this as multi-strategy scheduling. It splits the graph into stages and picks, per stage, the strategy that meets the budget: run an operator directly when it already fits, fuse a sequence of operators into a streaming pipeline so their intermediates never fully materialize, or process a feature map in spatial strips, spilling to slower memory, when a single operator is too large on its own. The budget is a compile-time parameter; a tighter budget yields more and finer stages. The result is a single knob that trades inference latency against memory footprint, while the model at both ends of the curve stays the one you trained.
The same MobileNetV1 from section 5, compiled at descending budgets and measured on an ESP32-S3 (ESP-NN int8 kernels, mean of 10 runs):
| SRAM budget | Strategy chosen | Latency |
|---|---|---|
| 256 KB | flat, single stage | 1214 ms |
| 128 KB | chain-tiled stages | 1265 ms |
| 64 KB | 8 of 9 stages tiled | 1515 ms |
| 32 KB | 12 of 13 stages tiled | 2015 ms |
At 32 KB the scheduled peak is an eighth of the naive one, at 1.66 times the
flat latency. The weights and the outputs are identical in every row. For
comparison, the same int8 model fails to allocate under TFLite Micro with a
256 KB arena (ARENA_TOO_SMALL); the full harness and raw logs are in
tigris-bench.
This is what makes planning interesting for the retrofit case from the beginning. On deployed hardware the budget is not negotiable. A compile-time knob that adapts the execution shape to that budget, instead of the model to the device, is the difference between shipping a firmware update and redesigning the product.
7. Summary
The memory wall in embedded inference is usually treated as a model-size problem and answered with compression. A large part of it is a planning problem. Peak working memory is determined by the execution schedule and the allocation, and both are free decisions best made at compile time. Static lifetime analysis with offset and pool assignment recovers the memory that naive runtime allocation wastes, and tiling extends the reachable budget by changing the execution shape instead of the model. TiGrIS is built around this: an ahead-of-time compiler doing lifetime analysis, activation reuse, and multi-strategy tiling, emitting a binary plan that a small heap-free C99 runtime executes. Fit is settled statically, and in favor of the model you actually trained.
TiGrIS is developed at RAWS Labs, an applied research and engineering lab. If you are working against a memory budget like the one above, we are interested in hearing about it.