Core Concepts

The key ideas behind how TiGrIS compiles and executes models on memory-constrained devices.

Execution plan (.tgrs)

A .tgrs file is a binary artifact containing the operator schedule, memory map, tiling strategy, and weights. It is not code, but a data structure that the C runtime interprets. One plan can run on any target that has a compatible kernel backend.

ds_cnn.tgrs (26 KB)
├── header        magic, version, memory requirements
├── memory map    tensor offsets within the SRAM arena
├── stages[]      ordered list of execution stages
│   └── ops[]     operator descriptors (type, params, tensor refs)
└── weights       model weights (f32 or int8), read via XIP or copied to SRAM

The plan is deterministic. Every memory address is resolved at compile time. There is no dynamic allocation at runtime.

Memory pools

TiGrIS manages three distinct memory regions:

PoolSpeedTypical sizeUsed for
SRAMFastKilobytesActivation tensors, scratch buffers
PSRAMSlowMegabytesSpill/reload between stages
FlashRead-onlyMegabytesWeights (XIP), plan metadata

The -m flag sets the SRAM budget. The compiler guarantees the plan never exceeds it. For multi-stage models, PSRAM is required because intermediate tensors spill there between stages. Pass a second -m value to set the PSRAM budget. Models that fit in a single stage run with SRAM only.

Stages

The compiler splits the model graph into stages: groups of operators whose combined activation memory fits within the SRAM budget. Within a stage, all tensors are arena-allocated in SRAM. Between stages, intermediate tensors that are still needed later spill to PSRAM and reload when consumed.

A small model whose peak activation fits within the SRAM budget compiles into a single stage with no spilling. A larger model whose activations exceed the budget compiles into multiple stages with spill/reload between them.

Tiling

The compiler uses three tiling strategies to fit models into limited SRAM:

Temporal partitioning. The compiler splits the model graph into stages whose combined activations fit in SRAM. Between stages, intermediate tensors spill to PSRAM and reload when needed. This is the coarsest strategy and applies automatically when the full model does not fit.

Spatial tiling. When a single stage’s peak activation still exceeds the SRAM budget, the compiler tiles along the height dimension, processing horizontal strips instead of the full output tensor. Each strip only needs its receptive-field input rows plus halo overlap.

Chain tiling. When consecutive stages are all spatially tileable, the compiler fuses them into a chain. Tiles flow through the entire chain in SRAM, so intermediate tensors are never fully materialized and never hit PSRAM. This is the lowest-overhead strategy but requires every operator in the chain to support spatial decomposition.

For implementation details and formulas, see Tiling.

XIP (Execute in Place)

With --xip, weights stay in flash and are read via memory-mapped I/O at inference time. They are never copied to SRAM.

This is critical for larger models whose weights alone would exceed the SRAM budget. With XIP, SRAM is reserved exclusively for activations and scratch buffers.

The tradeoff is latency: flash reads are slower than SRAM reads. On ESP32-S3 with memory-mapped flash, the overhead is small (~5%) because the cache handles sequential weight access patterns well.

Kernel backends

The execution plan specifies what to compute (operator type, tensor shapes, quantization parameters), but not how. The kernel backend provides the actual operator implementations.

BackendTargetNotes
referenceAnyPortable C99. Correct but slow. Handles both float32 and int8 models.
cmsis-nnArm Cortex-MArm’s optimized kernels for Cortex-M family (DSP on M4/M7, Helium on M55, plain C fallback elsewhere).
esp-nnESP32 familyXtensa SIMD intrinsics, ~20x faster than reference.

The same .tgrs plan works with any backend. Swapping reference for esp-nn on ESP32-S3 can yield 10-20x speedups with the same plan and identical results.

Quantization

TiGrIS handles both float32 and int8 quantized models. For int8, the compiler folds ONNX QDQ (QuantizeLinear / DequantizeLinear) nodes at compile time, precomputing the fixed-point multipliers and right-shift values each operator needs. Both symmetric and asymmetric quantization are supported. Quantize your model with any standard tool and TiGrIS handles the rest.

Int8 is recommended for embedded deployment: it cuts memory by 4x and unlocks SIMD kernel backends (ESP-NN, CMSIS-NN) that only support integer arithmetic.