Quickstart

End-to-end walkthrough: take an ONNX model that does not fit in your target’s SRAM, tile it with TiGrIS, and deploy it.

We’ll use MobileNetV1 (int8 quantized, 128x128 input, 3.2M parameters). Its naive peak activation memory is 256 KiB, but we’ll compile it for a 64 KB SRAM budget on an ESP32-S3. TiGrIS tiles it into 9 stages and runs it in ~1.4 seconds. For full benchmark results, see Introducing TiGrIS.

Prerequisites

Python 3.10+ with tigris-ml installed
An ONNX model (f32, int8, or any quantization)

pip install tigris-ml

Any ONNX model works. This walkthrough uses MobileNetV1 (128x128). To generate it and the other benchmark models, run python models/prepare.py from tigris-bench.

Step 1: Analyze

Check whether the model fits within a 64 KB SRAM + 8 MB PSRAM budget (typical for an ESP32-S3):

tigris analyze mobilenet_v1_i8.onnx -m 64K -m 8M -f 16M

╭────────────────────── TiGrIS - mobilenet_v1_i8 ──────────────────────╮
│ Operators            30                                              │
│ Tensors              114 (31 activations)                            │
│ Peak memory (naive)  256.00 KiB                                      │
│ Largest tensor       1x64x64x64 (256.00 KiB)                         │
│ Quantization         INT8 (QDQ)                                      │
╰──────────────────────────────────────────────────────────────────────╯
╭──────────────────────────────── SRAM ────────────────────────────────╮
│ Budget              64.00 KiB                                        │
│   pool 2 (slow)     8.00 MiB                                         │
│ Scheduled peak      64.00 KiB (25.0% of naive peak)                  │
│ Stages              9                                                │
│ Spill / reload I/O  992.01 KiB / 1.02 MiB                            │
│                                                                      │
│ Need tiling         6 of 9 stages                                    │
│   tileable          6 (18 tiles, max halo 2)                         │
╰────────────────  PASS - tiling resolves all stages  ─────────────────╯
╭──────────────────────────────── Flash ────────────────────────────────╮
│ Budget         16.00 MiB                                              │
│ Weight data     3.09 MiB                                              │
│ Plan overhead   0.01 MiB                                              │
│ Plan (est.)     3.10 MiB                                              │
╰───────────────────────  PASS - plan fits  ───────────────────────────╯

The model’s naive peak is 256 KiB but the SRAM budget is 64 KiB. The compiler partitions the graph into 9 stages, 6 of which need spatial tiling (18 tiles total). Intermediate results spill to the 8 MB PSRAM pool between stages. No hardware required to run analyze.

Step 2: Compile

Generate a binary execution plan:

tigris compile mobilenet_v1_i8.onnx -m 64K -m 8M -f 16M --xip -o mobilenet.tgrs

Binary plan written to mobilenet.tgrs
  30 ops, 9 stages @ 64.00 KiB budget
  plan size: 3.10 MiB
  flash 16.00 MiB: fits

Flag	Meaning
`-m 64K -m 8M`	Memory pools, fast to slow. First is SRAM budget, second is PSRAM. The compiler decides what goes where.
`-f 16M`	Flash budget. Warns if the plan doesn’t fit.
`--xip`	Execute-in-place. Weights are read from flash at runtime, not copied to SRAM.
`-o mobilenet.tgrs`	Output path for the binary plan.

PSRAM is required for multi-stage models. Without it, only single-stage models (where the full model fits in one SRAM arena) are supported.

Step 3: Generate C code

Use codegen to produce a backend-specific C harness:

tigris codegen mobilenet.tgrs --backend esp-nn -o mobilenet.c

Backend	Target
`reference`	Portable C99 (any platform)
`esp-nn`	Espressif optimized kernels (ESP32 family)
`cmsis-nn`	Arm optimized kernels (Cortex-M family)

Same plan, different kernels. codegen writes the target-specific runtime glue; the .tgrs plan remains a separate deployment artifact loaded from a file, flash partition, or linker-provided flash symbols depending on the backend. See Runtime Integration for manual loading details.

Step 4: Simulate (optional)

Inspect the execution trace before deploying:

tigris simulate mobilenet_v1_i8.onnx -m 64K -m 8M

This prints a step-by-step trace of what the runtime would do. No actual inference runs.

╭───────────────── TiGrIS Simulate - mobilenet_v1_i8 ──────────────────╮
│ 30 ops, 9 stages, 64.00 KiB budget, 256.00 KiB peak                  │
╰──────────────────────────────────────────────────────────────────────╯

──────────────────────────  Stage 0 (ops 0-0)  ─────────────────────────
  Peak: 48.00 KiB | Fits budget

  Reload inputs:
    input  [1, 3, 128, 128]  48.00 KiB  <- slow memory

 Step  Op              Type   In shape           Out shape          Live
    0  conv0_conv      Conv   [1, 3, 128, 128]   [1, 32, 64, 64]   48 KiB

  Spill outputs:
    conv0_out  [1, 32, 64, 64]  128.00 KiB  -> slow memory

──────────────────────────  Stage 1 (ops 1-1)  ─────────────────────────
  Peak: 128.00 KiB | Tiled: 3 tiles, 30 rows + 2 halo (RF 3)

  Reload inputs:
    conv0_out  [1, 32, 64, 64]  128.00 KiB  <- slow memory

 Step  Op              Type           In shape           Out shape
    1  b1_dw_conv      DepthwiseConv  [1, 32, 64, 64]   [1, 32, 64, 64]

  Spill outputs:
    b1_dw_out  [1, 32, 64, 64]  128.00 KiB  -> slow memory

  ... (7 more stages)

Each stage shows what gets reloaded from slow memory, which ops run, and what gets spilled back. Stage 1’s peak (128 KiB) exceeds the 64 KiB budget, so the compiler tiles it into 3 passes of 30 rows with a 2-row halo overlap.

Step 5: Deploy

The .tgrs plan contains the operator schedule, memory map, tiling parameters, and weights. On your target:

Deploy the plan where your generated or hand-written runtime code expects it.
Load with tigris_plan_load().
Initialize arenas with tigris_mem_init().
Run inference with tigris_run().

Minimal C example:

#include "tigris.h"
#include "tigris_loader.h"
#include "tigris_mem.h"
#include "tigris_executor.h"
#include "tigris_kernels_s8.h"

#include <string.h>

extern const uint8_t plan_data[];
extern const uint32_t plan_size;

static uint8_t fast_buf[64 * 1024];   /* 64K SRAM arena */
static uint8_t slow_buf[512 * 1024];  /* PSRAM for spills */

void run_inference(const int8_t *input, size_t input_size) {
    tigris_plan_t plan;
    tigris_plan_load(plan_data, plan_size, &plan);

    void *tensor_ptrs[128];  /* max tensors */
    tigris_mem_t mem;
    tigris_mem_init(&mem, (void **)tensor_ptrs, plan.header->num_tensors,
                    fast_buf, sizeof(fast_buf),
                    slow_buf, sizeof(slow_buf));

    uint16_t in_idx = plan.model_inputs[0];
    tigris_mem_alloc_slow(&mem, in_idx, plan.tensors[in_idx].size_bytes);
    memcpy(mem.tensor_ptrs[in_idx], input, input_size);

    tigris_exec_stats_t stats;
    tigris_run(&plan, &mem, tigris_dispatch_kernel_s8, NULL, &stats);

    uint16_t out_idx = plan.model_outputs[0];
    int8_t *output = (int8_t *)mem.tensor_ptrs[out_idx];
}

The runtime is about 8 KB of code, requires no heap allocation, and works on any C99 target. See Runtime Integration for error handling, ESP-IDF setup, and ESP-NN backend configuration.

What’s next

Core Concepts: tiling strategies, memory pools, execute-in-place
CLI Reference: all commands and flags
Runtime Integration: full C API and firmware integration

Installation

Docs

TiGrIS

Title here

Quickstart

Prerequisites

Step 1: Analyze

Step 2: Compile

Step 3: Generate C code

Step 4: Simulate (optional)

Step 5: Deploy

What’s next

Quickstart

Prerequisites#

Step 1: Analyze#

Step 2: Compile#

Step 3: Generate C code#

Step 4: Simulate (optional)#

Step 5: Deploy#

What’s next#

Prerequisites

Step 1: Analyze

Step 2: Compile

Step 3: Generate C code

Step 4: Simulate (optional)

Step 5: Deploy

What’s next