Quickstart
End-to-end walkthrough: take an ONNX model that does not fit in your target’s SRAM, tile it with TiGrIS, and deploy it.
We’ll use MobileNetV1 (int8 quantized, 128x128 input, 3.2M parameters). Its naive peak activation memory is 256 KiB, but we’ll compile it for a 64 KB SRAM budget on an ESP32-S3. TiGrIS tiles it into 9 stages and runs it in ~1.4 seconds. For full benchmark results, see Introducing TiGrIS.
Prerequisites
- Python 3.10+ with
tigris-mlinstalled - An ONNX model (f32, int8, or any quantization)
pip install tigris-mlAny ONNX model works. This walkthrough uses MobileNetV1 (128x128). To generate it and the other benchmark models, run python models/prepare.py from tigris-bench.
Step 1: Analyze
Check whether the model fits within a 64 KB SRAM + 8 MB PSRAM budget (typical for an ESP32-S3):
tigris analyze mobilenet_v1_i8.onnx -m 64K -m 8M -f 16M╭────────────────────── TiGrIS - mobilenet_v1_i8 ──────────────────────╮
│ Operators 30 │
│ Tensors 114 (31 activations) │
│ Peak memory (naive) 256.00 KiB │
│ Largest tensor 1x64x64x64 (256.00 KiB) │
│ Quantization INT8 (QDQ) │
╰──────────────────────────────────────────────────────────────────────╯
╭──────────────────────────────── SRAM ────────────────────────────────╮
│ Budget 64.00 KiB │
│ pool 2 (slow) 8.00 MiB │
│ Scheduled peak 64.00 KiB (25.0% of naive peak) │
│ Stages 9 │
│ Spill / reload I/O 992.01 KiB / 1.02 MiB │
│ │
│ Need tiling 6 of 9 stages │
│ tileable 6 (18 tiles, max halo 2) │
╰──────────────── PASS - tiling resolves all stages ─────────────────╯
╭──────────────────────────────── Flash ────────────────────────────────╮
│ Budget 16.00 MiB │
│ Weight data 3.09 MiB │
│ Plan overhead 0.01 MiB │
│ Plan (est.) 3.10 MiB │
╰─────────────────────── PASS - plan fits ───────────────────────────╯The model’s naive peak is 256 KiB but the SRAM budget is 64 KiB. The compiler partitions the graph into 9 stages, 6 of which need spatial tiling (18 tiles total). Intermediate results spill to the 8 MB PSRAM pool between stages. No hardware required to run analyze.
Step 2: Compile
Generate a binary execution plan:
tigris compile mobilenet_v1_i8.onnx -m 64K -m 8M -f 16M --xip -o mobilenet.tgrsBinary plan written to mobilenet.tgrs
30 ops, 9 stages @ 64.00 KiB budget
plan size: 3.10 MiB
flash 16.00 MiB: fits| Flag | Meaning |
|---|---|
-m 64K -m 8M | Memory pools, fast to slow. First is SRAM budget, second is PSRAM. The compiler decides what goes where. |
-f 16M | Flash budget. Warns if the plan doesn’t fit. |
--xip | Execute-in-place. Weights are read from flash at runtime, not copied to SRAM. |
-o mobilenet.tgrs | Output path for the binary plan. |
PSRAM is required for multi-stage models. Without it, only single-stage models (where the full model fits in one SRAM arena) are supported.
Step 3: Generate C code
Use codegen to produce a backend-specific C harness:
tigris codegen mobilenet.tgrs --backend esp-nn -o mobilenet.c| Backend | Target |
|---|---|
reference | Portable C99 (any platform) |
esp-nn | Espressif optimized kernels (ESP32 family) |
cmsis-nn | Arm optimized kernels (Cortex-M family) |
Same plan, different kernels. Alternatively, skip codegen and load the .tgrs file at runtime from a flash partition. See Runtime Integration for both approaches.
Step 4: Simulate (optional)
Inspect the execution trace before deploying:
tigris simulate mobilenet_v1_i8.onnx -m 64K -m 8MThis prints a step-by-step trace of what the runtime would do. No actual inference runs.
╭───────────────── TiGrIS Simulate - mobilenet_v1_i8 ──────────────────╮
│ 30 ops, 9 stages, 64.00 KiB budget, 256.00 KiB peak │
╰──────────────────────────────────────────────────────────────────────╯
────────────────────────── Stage 0 (ops 0-0) ─────────────────────────
Peak: 48.00 KiB | Fits budget
Reload inputs:
input [1, 3, 128, 128] 48.00 KiB <- slow memory
Step Op Type In shape Out shape Live
0 conv0_conv Conv [1, 3, 128, 128] [1, 32, 64, 64] 48 KiB
Spill outputs:
conv0_out [1, 32, 64, 64] 128.00 KiB -> slow memory
────────────────────────── Stage 1 (ops 1-1) ─────────────────────────
Peak: 128.00 KiB | Tiled: 3 tiles, 30 rows + 2 halo (RF 3)
Reload inputs:
conv0_out [1, 32, 64, 64] 128.00 KiB <- slow memory
Step Op Type In shape Out shape
1 b1_dw_conv DepthwiseConv [1, 32, 64, 64] [1, 32, 64, 64]
Spill outputs:
b1_dw_out [1, 32, 64, 64] 128.00 KiB -> slow memory
... (7 more stages)Each stage shows what gets reloaded from slow memory, which ops run, and what gets spilled back. Stage 1’s peak (128 KiB) exceeds the 64 KiB budget, so the compiler tiles it into 3 passes of 30 rows with a 2-row halo overlap.
Step 5: Deploy
The .tgrs plan contains the operator schedule, memory map, tiling parameters, and weights. On your target:
- Flash the plan to a partition, or embed it via
codegen. - Load with
tigris_plan_load(). - Initialize arenas with
tigris_mem_init(). - Run inference with
tigris_run().
Minimal C example:
#include "tigris.h"
#include "tigris_loader.h"
#include "tigris_mem.h"
#include "tigris_executor.h"
#include "tigris_kernels_s8.h"
#include <string.h>
extern const uint8_t plan_data[];
extern const uint32_t plan_size;
static uint8_t fast_buf[64 * 1024]; /* 64K SRAM arena */
static uint8_t slow_buf[512 * 1024]; /* PSRAM for spills */
void run_inference(const int8_t *input, size_t input_size) {
tigris_plan_t plan;
tigris_plan_load(plan_data, plan_size, &plan);
void *tensor_ptrs[128]; /* max tensors */
tigris_mem_t mem;
tigris_mem_init(&mem, (void **)tensor_ptrs, plan.header->num_tensors,
fast_buf, sizeof(fast_buf),
slow_buf, sizeof(slow_buf));
uint16_t in_idx = plan.model_inputs[0];
tigris_mem_alloc_slow(&mem, in_idx, plan.tensors[in_idx].size_bytes);
memcpy(mem.tensor_ptrs[in_idx], input, input_size);
tigris_exec_stats_t stats;
tigris_run(&plan, &mem, tigris_dispatch_kernel_s8, NULL, &stats);
uint16_t out_idx = plan.model_outputs[0];
int8_t *output = (int8_t *)mem.tensor_ptrs[out_idx];
}The runtime is about 8 KB of code, requires no heap allocation, and works on any C99 target. See Runtime Integration for error handling, ESP-IDF setup, and ESP-NN backend configuration.
What’s next
- Core Concepts: tiling strategies, memory pools, execute-in-place
- CLI Reference: all commands and flags
- Runtime Integration: full C API and firmware integration