Llamatron - Inference Platform

01 - The Abstraction

You Write Business Logic. The Network Handles the Rest.

Your application talks to one simple client. Behind the scenes, the inference network manages resources, routes jobs, profiles hardware, and scales automatically.

4 Lines of Code

Create client, submit job, connect, infer. The entire GPU cluster is abstracted behind one simple interface.

Zero Configuration

No manual GPU sizing, no throughput tuning, no batch size optimization. It's all profiled automatically at startup.

Works with 1 GPU

Even a single GPU benefits from automatic profiling, job scheduling, VRAM management, and model caching.

02 - Handshake

Client-Server Handshake

The full sequence from job submission to inference streaming. The client library handles all of this - your app just calls submit, connect, infer.

Token-Based Auth

Each job gets a unique UUID token. The client passes it in gRPC metadata. The server validates it against MongoDB before processing frames.

Queued Jobs Wait

If no GPU capacity is available, the job enters a priority queue. The client polls every 3 seconds. The scheduler leader promotes jobs as slots free up.

Heartbeat Keep-alive

Once connected, a background heartbeat runs every 10 seconds. If it stops, the scheduler expires the job and releases its GPU slots for other jobs.

03 - Client Modes

Simple Mode vs Cluster Mode

Simple mode sends everything through one gateway - the server fans out internally. Cluster mode connects directly to all servers and distributes frames by measured throughput.

Simple Mode (Gateway)

1 gRPC connection to the gateway server. The gateway receives all frames and fans out to peers internally. Low complexity, reliable, with automatic retry and reconnection.

Cluster Mode (Direct)

N direct gRPC connections. Client distributes frames weighted by measured throughput (EMA-smoothed). Weights adapt dynamically as server performance changes.

Auto Mode

Detects cluster size and server roles automatically. Picks Simple for single-server plans, Cluster for multi-server with direct roles.

04 - Topologies

From One GPU to Many

Works with any topology. Single node with one GPU, multi-GPU nodes, or a full cluster. The system adapts to whatever hardware you have.

05 - Gateway Fan-out

Frames Through the Gateway

In simple mode, your client sends frames to one server. The gateway batches them and processes locally. If scoring indicates it's beneficial, it fans out to peers - otherwise it handles everything itself.

06 - Direct Fan-out

Smart Client Multi-Connect

In cluster mode, the client connects directly to all servers, distributes frames by measured throughput, and adapts weights dynamically using EMA smoothing.

07 - Auto Profiling

GPU Profiling Happens Automatically

When a server starts, it profiles every model: VRAM usage, throughput at each batch size, multi-slot capability, memory hierarchy. No manual configuration needed.

Models That Fit Are Available

Bootstrap loads each model, measures actual VRAM (not estimates), checks that model + batch + buffer all fit. If they do, the model is marked available for scheduling.

Models That Don't Fit Are Excluded

If a model exceeds GPU VRAM or leaves no room for batch activations and the 512MB safety buffer, it's marked unavailable. The scheduler never routes jobs there - no OOM crashes.

Per-GPU, Per-Model Profiles

Different GPUs have different VRAM. An A100 (40GB) may fit a model that an RTX 3090 (24GB) can't. Each GPU gets its own profile with per-model availability stored in MongoDB.

08 - Model Pool

Three-Tier Memory Hierarchy

Models move between Disk, RAM, and VRAM automatically. Multiple plugins can share the same model via reference counting. In step mode, models swap between RAM and VRAM between pipeline stages.

Disk → RAM → VRAM

Cold load from disk takes ~4s. Warm restore from RAM cache takes ~0.3s. The pool automatically caches models in RAM when they're evicted from VRAM for 10-100x faster reload.

ModelPool - Shared Weights

Multiple plugins that use the same model (e.g. face_recognition and age_estimation both use yolov8-face) share one loaded instance via ref counting. No duplicate VRAM.

ModelManager - Multi-Slot

For concurrent jobs, ModelManager loads separate handler instances - potentially on different GPUs. Each instance tracks active_jobs. New work routes to the least-busy one.

09 - Integrations

Pluggable Inference Backends

Each server can run local GPU plugins or connect to external inference engines. The abstraction layer makes all backends look the same to the scheduler.

Local GPU Plugins

Subclass InferenceHandler, register with a decorator. Models load directly onto CUDA. Auto-profiled for VRAM, throughput, and batch calibration at startup.

External Backends

Triton (ONNX/TensorRT/PyTorch), vLLM (LLMs via OpenAI API), and Ollama (auto-pull models). All implement the same async interface - connect, infer, list models.

Mix and Match

A single server can run local CUDA plugins AND connect to Triton/vLLM/Ollama simultaneously. The scheduler scores all backends identically - it doesn't care how inference runs.

10 - Leader Election

Nodes Elect a Leader

One node becomes the scheduler leader via Consul distributed locking. The leader handles all scheduling decisions: job assignment, plan adjustments, queue promotion, and health monitoring.

11 - Cluster Bootstrap

Node Joins, Cluster Re-calibrates

When a new node joins the cluster, it bootstraps automatically. The cluster then runs network latency probes and recalculates routing parameters across all nodes.

12 - Job Lifecycle

Submit, Score, Assign, Process

Jobs are intelligently scored against all available GPUs, assigned with fan-out when beneficial, and prioritized when capacity is limited. The scheduler makes data-driven routing decisions.

13 - Token Wait

Waiting for a Slot

When all GPUs are busy, your job enters a priority queue. The client library polls automatically, waiting for the scheduler leader to assign a slot and accept the token.

14 - Dynamic Plans

Plans Adjust When Jobs Arrive

When a new job arrives and the cluster is busy, the scheduler contracts existing plans - the running client drops a connection so the new job can be served. When capacity frees up, plans expand back.

15 - Dynamic Scaling

Nodes Join and Leave. Zero Downtime.

Add GPU nodes anytime - they bootstrap, benchmark, and start serving. If a node goes down, the system adapts instantly: connections reroute, plans update, processing continues.

16 - Command Center

Monitor Everything

The Command Center dashboard shows real-time cluster health, GPU utilization, VRAM allocation, job status, throughput metrics, network latency, and batch calibration - all in one view.

Engineering

Roadmap

Planned enhancements to the inference platform - smarter scheduling policies, cost-aware routing, and secure remote inference.

Time-Based Schedule Policies

Scheduler Enhancement

Allocate GPU slots based on time windows. Full utilization overnight for batch processing, reduced capacity during business hours for interactive workloads. Define policies per GPU, per node, or cluster-wide.

overnight: 100% slots daytime: 50% slots weekends: 75% slots

Cost-Based Schedule Policies

Scheduler Enhancement

Prefer cheaper processing paths. Local and intranet GPUs are free - the scheduler favors them. Remote cloud GPUs factor in bandwidth and compute costs. Jobs route to the most cost-effective GPU that meets the performance requirement.

local GPU: $0/hr intranet GPU: $0/hr cloud GPU: $2.50/hr + bandwidth

gRPC Streaming over TLS

Security & Remote Inference

Encrypted gRPC streams for off-site and remote GPU inference. Run inference across data centers, cloud regions, or edge locations with mutual TLS authentication. Extends the cluster beyond the local network securely.

mTLS authentication cross-datacenter edge inference

Throughput Priority Scheduling

Scheduler Enhancement

Live workloads - natural language search, live video - get the fastest GPUs with lowest latency. Background tasks automatically yield to higher-powered GPUs when live inference needs them.

live: fastest GPU + lowest latency background: any available GPU

NVIDIA GPUs on ARM64

Platform Support

Run the inference platform on ARM64 servers with NVIDIA GPUs. Support for Jetson Orin, Grace Hopper, and ARM-based cloud instances with CUDA-capable GPUs.

Jetson Orin Grace Hopper ARM64 + CUDA

Inference Platform

Integrates into your existing workflows, asset management systems, and storage

You Write Business Logic. The Network Handles the Rest.

Client-Server Handshake

Simple Mode vs Cluster Mode

From One GPU to Many

Frames Through the Gateway

Smart Client Multi-Connect

GPU Profiling Happens Automatically

Three-Tier Memory Hierarchy

Pluggable Inference Backends

Nodes Elect a Leader

Node Joins, Cluster Re-calibrates

Submit, Score, Assign, Process

Waiting for a Slot

Plans Adjust When Jobs Arrive

Nodes Join and Leave. Zero Downtime.

Monitor Everything

External System Integrations

Roadmap

How It All Started

See It in Action

Installation Demo

Easy Integration — Iconik Example

Easy Uninstall

Get Started in One Command