Memory Model
TiGrIS uses a three-region memory model designed for embedded devices with heterogeneous memory: small fast SRAM, large slow PSRAM, and read-only flash. The compiler plans all allocations at compile time. The runtime uses bump allocators with zero heap allocation during inference.
PSRAM is required for any model that compiles to more than one stage. Intermediate tensors spill from SRAM to PSRAM between stages because flash is read-only and cannot serve as spill storage. Models that fit in a single stage can run with SRAM only.
Memory regions
SRAM (fast arena)
The primary working memory for inference. All activation tensors during op execution live here.
- Bump allocator, pointer advances forward with alignment padding.
- Reset per stage or per tile iteration.
- Size set by the
-mflag (e.g.,-m 256K). - An optional
fast_reservedprefix at the start of the arena holds decompressed weight blocks and survives arena resets.
PSRAM (slow buffer)
External RAM for inter-stage tensor storage. Required for any multi-stage plan. Present on targets like ESP32-S3 (2-16 MB PSRAM depending on variant).
- Bump allocator, persistent across stages.
- Compacted after each stage: dead tensors are reclaimed and live tensors are shifted down.
- Stage inputs are loaded from here into the fast arena. Stage outputs are spilled from the fast arena to here.
- Model inputs are pre-allocated here by the caller before inference.
- Model outputs remain here after inference completes.
Flash (read-only, XIP)
Model weights are stored in flash and accessed via execute-in-place (XIP) memory mapping. With the --xip flag, weights are never copied to RAM. The runtime reads them directly from their flash addresses.
tigris compile model.onnx -m 256K -f 4M --xip -o model.tgrsThe -f flag validates that the compiled plan (weights + metadata) fits within the flash budget.
Arena allocation
The fast arena is a contiguous block of SRAM. Allocation works as follows:
Arena layout:
Low addr High addr
┌───────────┬──────────┬──────────┬──────────────┐
│ reserved │ Tensor A │ Tensor B │ free space │
│ (weights) │ (aligned)│ (aligned)│ │
└───────────┴──────────┴──────────┴──────────────┘
^ ^
arena_base bump_ptr- The bump pointer starts after
fast_reservedand advances with each allocation. - Each allocation is padded to
TIGRIS_TENSOR_ALIGNbytes. - On arena reset (between stages or tiles), the bump pointer rewinds to
arena_base, butfast_reservedis preserved.
OOM handling
If an allocation exceeds remaining fast arena space:
- Compact: Scan for tensors already freed (ref count reached zero). Reclaim their space by shifting subsequent live tensors down. Retry the allocation.
- Overflow: If compaction is insufficient, allocate from the slow pool (PSRAM). This is a fallback that works but incurs latency from slow memory access.
This two-step fallback means inference never fails due to memory pressure. It degrades gracefully at the cost of speed.
Tensor lifetime management
Intra-stage (within a stage)
Each activation tensor has a reference count equal to the number of ops that consume it. After each op executes, the runtime decrements the ref count of each input tensor. When the count reaches zero, the tensor is freed (eligible for reclamation on the next compact).
Inter-stage (between stages)
- Stage outputs: After a stage completes, tensors consumed by later stages are spilled (copied) to the slow buffer.
- Stage inputs: Before a stage executes, tensors produced by earlier stages are loaded from the slow buffer into the fast arena.
- The fast arena is reset between stages, so all intra-stage temporaries are reclaimed automatically.
Special tensors
- Model inputs: Allocated in the slow buffer by the caller before invoking the executor. The first stage loads them.
- Model outputs: Left in the slow buffer after the last stage. The caller reads them directly.
- Constants/weights: Reside in flash. Not allocated in either arena. Accessed by pointer.
Alignment
Tensor addresses are aligned to TIGRIS_TENSOR_ALIGN bytes. The value is auto-detected at compile time based on the target architecture.
| Architecture | TIGRIS_TENSOR_ALIGN | Notes |
|---|---|---|
| Xtensa (ESP32-S3) | 8 bytes | Matches ee.vld.l.64.ip load alignment |
| AArch64 (Cortex-A, RPi) | 16 bytes | NEON 128-bit vector alignment |
| x86_64 | 32 bytes | AVX2 256-bit vector alignment |
| ARM32 (Cortex-M) | 4 bytes | Minimum word alignment |
Override with -DTIGRIS_TENSOR_ALIGN=N at build time. Backend-specific scratch buffers (e.g., ESP-NN SIMD) may require stricter alignment and are handled separately by the backend adapter.
Memory budget sizing
The -m flag sets the SRAM budget available for inference. Choosing the right budget involves a tradeoff:
| Budget | Stages | Spill traffic | Tiling overhead | Inference speed |
|---|---|---|---|---|
| Large | Fewer | Less | Less | Faster |
| Small | More | More | More | Slower, but frees SRAM |
A larger budget means fewer temporal partitions, less data movement between SRAM and PSRAM, and fewer tiled iterations. A smaller budget frees SRAM for other uses like DMA buffers, RTOS stacks, or backend scratch buffers.
Backend scratch buffers
Some kernel backends (ESP-NN, CMSIS-NN) need scratch memory for internal computation. This scratch is carved from the top of the fast arena by a prepare() call before inference starts. The effective arena size for activations is reduced accordingly.
Arena with backend scratch:
Low addr High addr
┌───────────┬──────────────────┬──────────────────┐
│ reserved │ activation space │ backend scratch │
│ (weights) │ (bump allocator) │ (SIMD scratch) │
└───────────┴──────────────────┴──────────────────┘The sweet spot depends on the model and backend. Example from a large int8 model on ESP32-S3:
| Configuration | Arena | Scratch | Inference latency |
|---|---|---|---|
| Minimal scratch | 232 KB | 23 KB | ~40 s |
| Balanced | 64 KB | 180 KB | ~11 s |
The balanced configuration is roughly 4x faster because ESP-NN’s SIMD kernels use the scratch space effectively, and the additional tiling overhead from a smaller arena is more than offset by faster kernel execution.
Sizing guideline
- Run
tigris analyze model.onnx -m <budget>with different budgets to see the stage count and tiling plan. - Reserve enough SRAM for the backend’s scratch requirements.
- Start with the smallest budget that keeps the stage count reasonable (under ~20 stages for most models), then increase if latency is too high.