Memory Model

TiGrIS uses a three-region memory model designed for embedded devices with heterogeneous memory: small fast SRAM, large slow PSRAM, and read-only flash. The compiler plans all allocations at compile time. The runtime uses bump allocators with zero heap allocation during inference.

PSRAM is required for any model that compiles to more than one stage. Intermediate tensors spill from SRAM to PSRAM between stages because flash is read-only and cannot serve as spill storage. Models that fit in a single stage can run with SRAM only.

Memory regions

SRAM (fast arena)

The primary working memory for inference. All activation tensors during op execution live here.

Bump allocator, pointer advances forward with alignment padding.
Reset per stage or per tile iteration.
Size set by the -m flag (e.g., -m 256K).
An optional fast_reserved prefix at the start of the arena holds decompressed weight blocks and survives arena resets.

PSRAM (slow buffer)

External RAM for inter-stage tensor storage. Required for any multi-stage plan. Present on targets like ESP32-S3 (2-16 MB PSRAM depending on variant).

Bump allocator, persistent across stages.
Compacted after each stage: dead tensors are reclaimed and live tensors are shifted down.
Stage inputs are loaded from here into the fast arena. Stage outputs are spilled from the fast arena to here.
Model inputs are pre-allocated here by the caller before inference.
Model outputs remain here after inference completes.

Flash (read-only, XIP)

Model weights are stored in flash and can be accessed via execute-in-place (XIP) memory mapping. With --xip on an uncompressed plan, the runtime reads weights directly from their flash addresses instead of copying the full weight blob to RAM. With weight compression enabled, stage weight blocks are still stored in flash but are decompressed into the fast arena as each stage runs.

tigris compile model.onnx -m 256K -f 4M --xip -o model.tgrs

The -f flag validates that the compiled plan (weights + metadata) fits within the flash budget.

Arena allocation

The fast arena is a contiguous block of SRAM. Allocation works as follows:

Arena layout:

Low addr                                 High addr
┌───────────┬──────────┬──────────┬──────────────┐
│ reserved  │ Tensor A │ Tensor B │ free space   │
│ (weights) │ (aligned)│ (aligned)│              │
└───────────┴──────────┴──────────┴──────────────┘
             ^                     ^
             arena_base            bump_ptr

The bump pointer starts after fast_reserved and advances with each allocation.
Each allocation is padded to TIGRIS_TENSOR_ALIGN bytes.
On arena reset (between stages or tiles), the bump pointer rewinds to arena_base, but fast_reserved is preserved.

OOM handling

If an allocation exceeds remaining fast arena space:

Compact: Scan for tensors already freed (ref count reached zero). Reclaim their space by shifting subsequent live tensors down. Retry the allocation.
Overflow: If compaction is insufficient, allocate from the slow pool (PSRAM). This is a fallback that works but incurs latency from slow memory access.

This two-step fallback means inference never fails due to memory pressure. It degrades gracefully at the cost of speed.

Tensor lifetime management

Intra-stage (within a stage)

Each activation tensor has a reference count equal to the number of ops that consume it. After each op executes, the runtime decrements the ref count of each input tensor. When the count reaches zero, the tensor is freed (eligible for reclamation on the next compact).

Inter-stage (between stages)

Stage outputs: After a stage completes, tensors consumed by later stages are spilled (copied) to the slow buffer.
Stage inputs: Before a stage executes, tensors produced by earlier stages are loaded from the slow buffer into the fast arena.
The fast arena is reset between stages, so all intra-stage temporaries are reclaimed automatically.

Special tensors

Model inputs: Allocated in the slow buffer by the caller before invoking the executor. The first stage loads them.
Model outputs: Left in the slow buffer after the last stage. The caller reads them directly.
Constants/weights: Reside in flash. Not allocated in either arena. Accessed by pointer.

Alignment

Tensor addresses are aligned to TIGRIS_TENSOR_ALIGN bytes. The value is auto-detected at compile time based on the target architecture.

Architecture	TIGRIS_TENSOR_ALIGN	Notes
Xtensa (ESP32-S3)	8 bytes	Matches `ee.vld.l.64.ip` load alignment
AArch64 (Cortex-A, RPi)	16 bytes	NEON 128-bit vector alignment
x86_64	32 bytes	AVX2 256-bit vector alignment
ARM32 (Cortex-M)	4 bytes	Minimum word alignment

Override with -DTIGRIS_TENSOR_ALIGN=N at build time. Backend-specific scratch buffers (e.g., ESP-NN SIMD) may require stricter alignment and are handled separately by the backend adapter.

Memory budget sizing

The -m flag sets the SRAM budget available for inference. Choosing the right budget involves a tradeoff:

Budget	Stages	Spill traffic	Tiling overhead	Inference speed
Large	Fewer	Less	Less	Faster
Small	More	More	More	Slower, but frees SRAM

A larger budget means fewer temporal partitions, less data movement between SRAM and PSRAM, and fewer tiled iterations. A smaller budget frees SRAM for other uses like DMA buffers, RTOS stacks, or backend scratch buffers.

Backend scratch buffers

Some kernel backends (ESP-NN, CMSIS-NN) need scratch memory for internal computation. This scratch is carved from the top of the fast arena by a prepare() call before inference starts. The effective arena size for activations is reduced accordingly.

Arena with backend scratch:

Low addr                                   High addr
┌───────────┬──────────────────┬──────────────────┐
│ reserved  │ activation space │ backend scratch  │
│ (weights) │ (bump allocator) │ (SIMD scratch)   │
└───────────┴──────────────────┴──────────────────┘

The sweet spot depends on the model and backend. Example from a large int8 model on ESP32-S3:

Configuration	Arena	Scratch	Inference latency
Minimal scratch	232 KB	23 KB	~40 s
Balanced	64 KB	180 KB	~11 s

The balanced configuration is roughly 4x faster because ESP-NN’s SIMD kernels use the scratch space effectively, and the additional tiling overhead from a smaller arena is more than offset by faster kernel execution.

Sizing guideline

Run tigris analyze model.onnx -m <budget> with different budgets to see the stage count and tiling plan.
Reserve enough SRAM for the backend’s scratch requirements.
Start with the smallest budget that keeps the stage count reasonable (under ~20 stages for most models), then increase if latency is too high.

Tiling

Integration

Docs

TiGrIS

Title here

Memory Model

Memory regions

SRAM (fast arena)

PSRAM (slow buffer)

Flash (read-only, XIP)

Arena allocation

OOM handling

Tensor lifetime management

Intra-stage (within a stage)

Inter-stage (between stages)

Special tensors

Alignment

Memory budget sizing

Backend scratch buffers

Sizing guideline

Memory Model

Memory regions#

SRAM (fast arena)#

PSRAM (slow buffer)#

Flash (read-only, XIP)#

Arena allocation#

OOM handling#

Tensor lifetime management#

Intra-stage (within a stage)#

Inter-stage (between stages)#

Special tensors#

Alignment#

Memory budget sizing#

Backend scratch buffers#

Sizing guideline#

Memory regions

SRAM (fast arena)

PSRAM (slow buffer)

Flash (read-only, XIP)

Arena allocation

OOM handling

Tensor lifetime management

Intra-stage (within a stage)

Inter-stage (between stages)

Special tensors

Alignment

Memory budget sizing

Backend scratch buffers

Sizing guideline