Done
Client App Client App
GPU Inference Platform

Inference Abstraction

Focus on your business logic. The platform handles GPU resource planning, automatic model profiling, dynamic scaling, and intelligent job routing - even with a single GPU.

Automatic GPU Profiling
Dynamic Scaling
Smart & Simple Modes
Priority Job Routing

Integrates into your existing workflows, asset management systems, and storage

01 - The Abstraction

You Write Business Logic. The Network Handles the Rest.

Your application talks to one simple client. Behind the scenes, the inference network manages resources, routes jobs, profiles hardware, and scales automatically.

YOUR APPLICATION client = create_client () job = submit_job (type, frames) connect (job) results = infer_batch (frames) complete () 4 lines. That's it. Results Returned BOUNDARY InferenceClient INFERENCE NETWORK HANDLES THIS PLATFORM SERVICES Consul Discovery MongoDB State Prometheus Metrics Keycloak Auth GPU Server 1 192.168.1.10:50051 GPU-0 GPU-1 Scheduler VRAM Mgr Scaler Batcher Profiler GPU Server 2 192.168.1.11:50051 GPU-0 Scheduler VRAM Mgr Scaler Batcher Profiler GPU Server 3 192.168.1.12:50051 GPU-0 Scheduler VRAM Mgr Scaler Batcher Profiler EACH SERVER RUNS: Scheduler (leader-elected) VRAM Manager Adaptive Scaler Batch Collector Auto Profiler 8ms 12ms 5ms Automatic profiling + Dynamic scaling + Intelligent routing + VRAM management + Health monitoring
4 Lines of Code
Create client, submit job, connect, infer. The entire GPU cluster is abstracted behind one simple interface.
Zero Configuration
No manual GPU sizing, no throughput tuning, no batch size optimization. It's all profiled automatically at startup.
Works with 1 GPU
Even a single GPU benefits from automatic profiling, job scheduling, VRAM management, and model caching.
Automatic Model Swapping - More Models Than VRAM 3 models need 36 GB total, but the GPU only has 24 GB. The server swaps automatically as jobs arrive. REGISTERED MODELS (36 GB TOTAL) face_rec 8 GB obj_detect 12 GB video_seg 16 GB Job A: face_recognition face_rec (8 GB) sys free: 14.5 GB ✓ processing 10k frames Job B arrives: object_detection - fits alongside! face_rec (8 GB) obj_detect (12 GB) free: 2.5 GB ✓ both loaded Job C arrives: video_segmentation (16 GB) - won't fit! Swapping... face_rec → RAM obj_detect → RAM evict to RAM cache (if possible) (warm: 0.3s reload) Swapped! video_segmentation now running video_segmentation (16 GB) free: 6.5 GB ✓ processing RAM CACHE (WARM) face_rec (cached) obj_detect (cached) Reload to VRAM in ~0.3s (vs 4.2s cold from disk)
02 - Handshake

Client-Server Handshake

The full sequence from job submission to inference streaming. The client library handles all of this - your app just calls submit, connect, infer.

Your Application App Code InferenceClient (library) Scheduler GPU Server Function call (in-process) REST API (over network) gRPC (over network) submit_job() POST /cluster/jobs/submit Score GPUs Claim slots Create plan Gen token_id {status, token_id, servers[]} IF STATUS == "queued" (no capacity) GET /cluster/jobs/{id} (poll every 3s) {status: "queued"} ... wait ... Leader promotes when slot frees {status:"assigned", servers[], token} connect(job) Auto-select mode gRPC Connect (scheduler-token: {token_id}) Validate token OPT - Cluster mode: connect to all assigned servers gRPC Connect → assigned GPU Server 2 gRPC Connect → assigned GPU Server N ... heartbeat loop (every 10s) infer_batch(frames) gRPC InferStream (frames) mark_active(job_id) results stream results complete() POST /cluster/jobs/{id}/complete Release slots status=done 1 2 3 4 5 6 7 8 Client lib → Scheduler REST / httpx async over network Client lib → GPU gRPC bidirectional stream / over network
Token-Based Auth
Each job gets a unique UUID token. The client passes it in gRPC metadata. The server validates it against MongoDB before processing frames.
Queued Jobs Wait
If no GPU capacity is available, the job enters a priority queue. The client polls every 3 seconds. The scheduler leader promotes jobs as slots free up.
Heartbeat Keep-alive
Once connected, a background heartbeat runs every 10 seconds. If it stops, the scheduler expires the job and releases its GPU slots for other jobs.
03 - Client Modes

Simple Mode vs Cluster Mode

Simple mode sends everything through one gateway - the server fans out internally. Cluster mode connects directly to all servers and distributes frames by measured throughput.

Simple Mode "Send to one server. It handles the rest." My App simple 1 gRPC Gateway Server 192.168.1.10:50051 65% vram util fan-out fan-out Peer Server 1 192.168.1.11:50051 45% Peer Server 2 192.168.1.12:50051 40% 12ms Cluster Mode "Connect directly. Distribute by throughput." My App cluster 60% 25% 15% Server 1 192.168.1.10 78% 72% 120 FPS - 60% Server 2 192.168.1.11 52% 50 FPS - 25% Server 3 192.168.1.12 30% 30 FPS - 15% 8ms 12ms 15ms Auto Mode - Selects the best mode automatically servers > 1 AND role == "direct" ? ClusterMode : SimpleMode MAXIMUM THROUGHPUT
Simple Mode (Gateway)
1 gRPC connection to the gateway server. The gateway receives all frames and fans out to peers internally. Low complexity, reliable, with automatic retry and reconnection.
Cluster Mode (Direct)
N direct gRPC connections. Client distributes frames weighted by measured throughput (EMA-smoothed). Weights adapt dynamically as server performance changes.
Auto Mode
Detects cluster size and server roles automatically. Picks Simple for single-server plans, Cluster for multi-server with direct roles.
04 - Topologies

From One GPU to Many

Works with any topology. Single node with one GPU, multi-GPU nodes, or a full cluster. The system adapts to whatever hardware you have.

Single Node 1 Node, 1 GPU inference-server-0 192.168.1.10:50051 vram util RTX 4090 40% GPU-0 Client App simple gRPC ACTIVE face_recognition 95.3 FPS • 10,000 frames Still gets: AUTO PROFILING JOB SCHEDULING VRAM MANAGEMENT MODEL CACHING Two Nodes 2 Nodes, 3 GPUs (mixed) Node A (2x GPU) 192.168.1.10:50051 RTX 4090 72% RTX 4090 68% Node B (1x GPU) 192.168.1.11:50051 RTX 3090 80% 8ms Client App cluster 65% 35% NODE A: 2 SLOTS 120+115 FPS (multi-GPU parallel) NODE B: 1 SLOT 80 FPS (single GPU) Full Cluster 4 Nodes, 9 GPUs (mixed) Node 1 2x RTX 4090 GPU-0 GPU-1 Node 2 1x RTX 3090 GPU-0 Node 3 2x RTX 4090 GPU-0 GPU-1 Node 4 4x B6000 96GB GPU-0 GPU-1 GPU-2 GPU-3 4 SLOTS • 384 GB CONSUL App A simple App B cluster App C auto Scales horizontally ADD NODES ANYTIME
05 - Gateway Fan-out

Frames Through the Gateway

In simple mode, your client sends frames to one server. The gateway batches them and processes locally. If scoring indicates it's beneficial, it fans out to peers - otherwise it handles everything itself.

Client App FRAME BUFFER encode + resize gRPC STREAM scheduler-token header Gateway Server 192.168.1.10:50051 BATCH QUEUE max=16 / 50ms vram util RTX 4090 72% FAN-OUT FAN-OUT Peer Server 1 192.168.1.11:50051 55% Peer Server 2 192.168.1.12:50051 48% 40% (2 frames) 60% (3 frames) 8ms 12ms 5ms RESULTS HEARTBEAT 10s RETRY ON FAILURE AUTO RECONNECT Fan-out to peers only happens when scoring indicates it's beneficial: estimated_frames ≥ fanout_min • each peer adds ≥ 15% speedup • no queued jobs If the job is small or network overhead outweighs the gain, the gateway processes locally.
06 - Direct Fan-out

Smart Client Multi-Connect

In cluster mode, the client connects directly to all servers, distributes frames by measured throughput, and adapts weights dynamically using EMA smoothing.

Smart Client cluster _distribute_frames() Server 1: 12 (60%) Server 2: 5 (25%) Server 3: 3 (15%) WEIGHT UPDATE (EMA) w = 0.3*observed + 0.7*prev weights adapt every batch Server 1 (2x GPU) 192.168.1.10:50051 78% 72% Server 2 192.168.1.11:50051 52% Server 3 192.168.1.12:50051 30% 120 FPS 50 FPS 30 FPS 8ms 12ms 15ms Parallel Processing All 3 servers process simultaneously. Results aggregated by original frame index. Total: 200 FPS combined _refresh_plan() on each batch
07 - Auto Profiling

GPU Profiling Happens Automatically

When a server starts, it profiles every model: VRAM usage, throughput at each batch size, multi-slot capability, memory hierarchy. No manual configuration needed.

inference-server-0 - Bootstrapping RTX 4090 • 24 GB VRAM 1 VRAM Measuring actual VRAM consumed... 8,192 MB 2 THROUGHPUT Batch calibration OPT 95.3 FPS @ batch 32 3 MULTI-SLOT Can 2 instances fit? M1 M2 PASS max_slots: 2 4 MEMORY RAM↔VRAM swap RAM 0.3s GPU DISK 4.2s 5 gRPC Round-trip overhead grpc_overhead: 0.15s fanout_min: 48 frames COMPLETE Bootstrap Profile - Stored to MongoDB cluster_gpu_profiles collection gpu_name: "RTX 4090" total_vram: 24,576 MB max_slots: 2 fps: 95.3 optimal_batch: 32 model_vram: 8,192 MB cold_load: 2.1s warm_load: 0.05s ram_to_vram: 0.3s grpc_overhead: 0.15s fanout_min: 48 batch_calibration: 1 4 8 16 32 64 128 256 Scheduler now knows exactly how to use this GPU - no manual configuration needed Model Fit Check - Which Models Can Run? During bootstrap, each model is tested against available VRAM. Models that don't fit are marked unavailable. RTX 4090 - 24 GB TOTAL VRAM System Buffer 512MB face_recognition 8,192 MB model + 3,500 MB batch = 11.7 GB AVAILABLE face_recognition (11.7 GB) object_detection 2,048 MB model + 1,800 MB batch = 3.8 GB AVAILABLE large_language_model 26,000 MB model - exceeds 24 GB VRAM UNAVAILABLE video_segmentation 16,000 MB model + 8,000 MB batch = 24 GB (no buffer) UNAVAILABLE Unavailable models are excluded from scheduling. The scheduler never routes jobs to GPUs that can't fit the model.
Models That Fit Are Available
Bootstrap loads each model, measures actual VRAM (not estimates), checks that model + batch + buffer all fit. If they do, the model is marked available for scheduling.
Models That Don't Fit Are Excluded
If a model exceeds GPU VRAM or leaves no room for batch activations and the 512MB safety buffer, it's marked unavailable. The scheduler never routes jobs there - no OOM crashes.
Per-GPU, Per-Model Profiles
Different GPUs have different VRAM. An A100 (40GB) may fit a model that an RTX 3090 (24GB) can't. Each GPU gets its own profile with per-model availability stored in MongoDB.
08 - Model Pool

Three-Tier Memory Hierarchy

Models move between Disk, RAM, and VRAM automatically. Multiple plugins can share the same model via reference counting. In step mode, models swap between RAM and VRAM between pipeline stages.

Memory Hierarchy Models automatically move between tiers based on demand VRAM (GPU) Ready for inference • Fastest • Limited capacity yolov8-face ref: 2 • 800 MB arcface ref: 1 • 600 MB free: 14.6 GB to_ram_fn() from_ram_fn() ~0.3s RAM (System Memory) Warm cache • 10-100x faster reload than disk yolov8-face (cached) arcface (cached) unload_fn() load_fn() ~4.2s Disk (Storage) Cold load • Slowest • Always available yolov8-face.onnx arcface.onnx yolov8x.pt ... Step Mode - Pipeline Swapping EXPERIMENTAL When VRAM is tight, models swap between stages (must be enabled) Step Mode Problem: yolov8-face (800 MB) + arcface (600 MB) + batch overhead = 2.1 GB GPU only has 1.5 GB free VRAM - both models can't fit at once. Step mode swaps them. Stage 1: Detection yolov8-face (VRAM) free frames → detect faces → crop regions 142 crops swap_for_step() unload: yolov8 → RAM | load: arcface → VRAM Stage 2: Embedding arcface (VRAM) free crops → extract 512-d embeddings → output 142 embeddings Model Sharing - Two Levels ModelPool - Plugins Share Weights (ref counted) Same model loaded once. Multiple plugins acquire() it. Thread-safe inference. yolov8-face ref_count: 2 • 800 MB face_recognition age_estimation One GPU load, two consumers. release() drops ref. Evict at ref=0. Saves VRAM - no duplicate model weights ModelManager - Multi-Slot (separate instances) Same model type, but loaded multiple times for concurrent jobs. face_rec @ GPU-0 active_jobs: 1 • 800 MB face_rec @ GPU-1 active_jobs: 1 • 800 MB face_rec @ GPU-0 idle • draining Each job gets its own handler instance. Jobs run in parallel on different GPUs. ModelManager routes to least-busy instance (min active_jobs) Normal Mode All models in VRAM simultaneously. Fast. Needs enough VRAM for everything. Step Mode Swap models between stages. Slower but runs on GPUs with less VRAM.
Disk → RAM → VRAM
Cold load from disk takes ~4s. Warm restore from RAM cache takes ~0.3s. The pool automatically caches models in RAM when they're evicted from VRAM for 10-100x faster reload.
ModelPool - Shared Weights
Multiple plugins that use the same model (e.g. face_recognition and age_estimation both use yolov8-face) share one loaded instance via ref counting. No duplicate VRAM.
ModelManager - Multi-Slot
For concurrent jobs, ModelManager loads separate handler instances - potentially on different GPUs. Each instance tracks active_jobs. New work routes to the least-busy one.
09 - Integrations

Pluggable Inference Backends

Each server can run local GPU plugins or connect to external inference engines. The abstraction layer makes all backends look the same to the scheduler.

Inference Server - Two-Layer Backend Architecture Local GPU Plugins In-process models loaded directly onto GPU InferenceHandler load_model(device) infer_batch(images) unload_model() 👁 face_recognition YOLO-face + ArcFace • ONNX • 800MB 📷 object_detection YOLOv8x • Ultralytics • 2,000MB + your_custom_plugin @register_handler("type") GPU-0 Direct CUDA / ONNX Runtime InferenceRegistry - auto-discovers plugins on startup, profiles VRAM, calibrates batches External Service Integrations Remote backends via HTTP/gRPC APIs ExternalServiceIntegration connect() / disconnect() infer(model, image, prompt) list_models() / load_model() NVIDIA Triton ONNX, TensorRT, PyTorch, TF • HTTP/gRPC vLLM LLMs + VLMs • OpenAI-compatible API Ollama LLaVA, Llama, Mistral • REST API • Auto-pull + your_backend (subclass ExternalServiceIntegration) Async HTTP • OpenAI compat • Prometheus Unified interface - scheduler treats local and external backends identically for scoring and routing How the Scheduler Sees It All backends expose the same interface. The scheduler doesn't care if it's local CUDA or remote Triton. Scheduler score → claim → route (same for all) Server A - Local GPU RTX 4090 • CUDA direct GPU-0 face_recognition object_detection 95 FPS • profiled Server B - Triton + Local A100 • Triton :8000 + CUDA GPU-0 triton: resnet50 triton: yolov8-trt face_recognition mixed backends • same scoring Server C - B6000 + vLLM 4x B6000 96GB • vLLM + Ollama GPU-0 GPU-1 vllm: llama-3.1-70b ollama: llava-v1.6 ollama: mistral LLMs + vision • auto-pull
Local GPU Plugins
Subclass InferenceHandler, register with a decorator. Models load directly onto CUDA. Auto-profiled for VRAM, throughput, and batch calibration at startup.
External Backends
Triton (ONNX/TensorRT/PyTorch), vLLM (LLMs via OpenAI API), and Ollama (auto-pull models). All implement the same async interface - connect, infer, list models.
Mix and Match
A single server can run local CUDA plugins AND connect to Triton/vLLM/Ollama simultaneously. The scheduler scores all backends identically - it doesn't care how inference runs.
10 - Leader Election

Nodes Elect a Leader

One node becomes the scheduler leader via Consul distributed locking. The leader handles all scheduling decisions: job assignment, plan adjustments, queue promotion, and health monitoring.

Consul KV Store service/inference-server/scheduler-leader inference-server-0 192.168.1.10 - LEADER Runs background workers: _promote_loop (assign queued jobs) _reeval_loop (expand/contract plans) _expiry_loop (fail stale heartbeats) kv.put(acquire=session) ✓ inference-server-1 192.168.1.11 - follower Retries every 10s... No background workers locked ✗ inference-server-2 192.168.1.12 - follower Retries every 10s... No background workers locked ✗ What Happens if the Leader Fails? 1. Leader's Consul session expires (30s TTL) 2. KV key released automatically (behavior="release") 3. 5-second lock delay prevents thrashing 4. Next node acquires lock (5-15s total) 5. New leader starts background workers 6. Queued jobs resume promotion immediately
11 - Cluster Bootstrap

Node Joins, Cluster Re-calibrates

When a new node joins the cluster, it bootstraps automatically. The cluster then runs network latency probes and recalculates routing parameters across all nodes.

Existing Cluster Node A 192.168.1.10:50051 GPU-0 GPU-1 OPERATIONAL Node B 192.168.1.11:50051 GPU-0 OPERATIONAL 8ms Node C (New) 192.168.1.12:50051 GPU-0 BOOTSTRAPPING CONSUL REGISTER Cluster Benchmark Triggered SCHEDULING PAUSED - BENCHMARKING Node A 512 frames... Node B 512 frames... Node C 512 frames... 8ms 12ms LATENCY PROBING - 30s INTERVAL Cluster Calibrated fanout_min_frames recalculated • All profiles marked "operational" • Scheduling resumed Scheduler can now route jobs across all 3 nodes with accurate performance data
12 - Job Lifecycle

Submit, Score, Assign, Process

Jobs are intelligently scored against all available GPUs, assigned with fan-out when beneficial, and prioritized when capacity is limited. The scheduler makes data-driven routing decisions.

Submit Job REST POST Score Candidates Server A: 120 FPS ✓ Server B: 80 FPS ✓ Server C: 0 FPS ✗ VRAM fit, cold penalty, latency Claim Servers Atomic MongoDB op Fan-out if beneficial: speedup ≥ 15% Assigned capacity available Active gRPC streaming Complete slots released Queued waiting for capacity PROMOTED Fan-out Decision Logic scored_servers ≥ 2 AND estimated_frames ≥ fanout_min AND queued_jobs == 0 AND each_peer_speedup ≥ 15% Roles: smart → "direct" | dumb → "gateway"+"peer" Priority Queue Sorted by priority, then FIFO REALTIME priority: 0 (highest) NORMAL priority: 1 BATCH priority: 2 (lowest) Server Scoring Breakdown 1. Runtime metrics (warm) → actual FPS from recent jobs 2. Bootstrap profile → base FPS from calibration 3. Batch constraint → reduce if can't fit optimal batch in available VRAM 4. VRAM fit check → can model + batch fit? eviction needed? 5. Cold load penalty → fps = frames / (frames/fps + cold_load_time) 6. Network latency → reduce effective FPS if network is bottleneck All decisions recorded to cluster_scheduler_decisions (7-day TTL) Model-Aware Routing Jobs are only sent to servers that have the required model available. Servers without the model are skipped - score = 0. face_recognition 10,000 frames ROUTE Server A - RTX 4090 face_recognition object_detection Score: 120 FPS ✓ ASSIGNED Server B - RTX 3090 face_recognition Score: 80 FPS ✓ (fan-out candidate) Server C - A100 object_detection (no face_recognition) Score: 0 ✗ - model unavailable, skipped
13 - Token Wait

Waiting for a Slot

When all GPUs are busy, your job enters a priority queue. The client library polls automatically, waiting for the scheduler leader to assign a slot and accept the token.

All Slots Busy Client App submit_job() Scheduler No free slots! status: "queued" SLOT 1: BUSY SLOT 2: BUSY Waiting for Assignment Client Polls wait_for_assignment() GET /jobs/{id} every 3 seconds {status: "queued"} not yet... waiting... not yet... waiting... WAITING... PRIORITY QUEUE realtime job (p:0) - next your job (p:1) - waiting batch job (p:2) Slot Frees Up - Promoted! Previous job completes → slot released Leader Promotes _promote_queued_jobs() Atomically claims slot + assigns job Next Poll Returns {status: "assigned"} servers, token_id, gateway Client exits wait loop → connect(job) → gRPC stream with token → infer_batch(frames) → complete() Total wait time depends on cluster load and job priority. Realtime (p:0) jobs promote first.
14 - Dynamic Plans

Plans Adjust When Jobs Arrive

When a new job arrives and the cluster is busy, the scheduler contracts existing plans - the running client drops a connection so the new job can be served. When capacity frees up, plans expand back.

Steady State Job A uses 3 servers (direct fan-out) Client A cluster Server 1 Server 2 Server 3 48% 32% 20% JOB A JOB A JOB A New Job Arrives Plan contracts - weakest peer yielded Client A Client B new! Server 1 Server 2 Server 3 DROPPED 60% 40% JOB A JOB A JOB B Scheduler: queued > 0 → contract weakest peer (Server 3) Client A detects version change → closes connection → rebalances Job B Completes Plan expands - Server 3 returns to Job A Client A Srv 1 Srv 2 Srv 3 48% 32% 20% ALL 3 SERVERS: JOB A Scheduler: remaining ≥ fanout_min → expand back How it works under the hood Scheduler: $pull weakest from servers[], $inc version → Client: detects version > plan_version → closes stale conn, rebalances weights
15 - Dynamic Scaling

Nodes Join and Leave. Zero Downtime.

Add GPU nodes anytime - they bootstrap, benchmark, and start serving. If a node goes down, the system adapts instantly: connections reroute, plans update, processing continues.

Running 2 nodes, active job Node A GPU-0 GPU-1 Node B GPU-0 Client Node Joins Bootstrap, calibrate, expand Node A Node B Node C Client NEW CONN Node Leaves Detect, remove, reroute Node A Node B TTL EXPIRED Client CLOSED Zero downtime. Failed frames retried on healthy servers. Weights automatically rebalance.
16 - Command Center

Monitor Everything

The Command Center dashboard shows real-time cluster health, GPU utilization, VRAM allocation, job status, throughput metrics, network latency, and batch calibration - all in one view.

L Inference Cluster Command Center HEALTHY NODES 3 / 3 HEALTHY TOTAL GPUs 7 VRAM USED 109.3 GB / 456 GB ACTIVE MODELS face_recognition object_detection video_seg inference-server-0 192.168.1.10:50051 HEALTHY RTX 4090 72% vram util RTX 4090 68% face_recognition 8.2 GB object_detection 4.1 GB framework 1.8 GB 52,340 frames processed up 14h 32m • 2 active jobs VRAM: 14.1 / 24.0 GB (59%) inference-server-1 192.168.1.11:50051 HEALTHY RTX 3090 80% vram util face_recognition 8.2 GB 28,100 frames • up 14h 32m VRAM: 10.0 / 24.0 GB (42%) inference-server-2 192.168.1.12:50051 4x B6000 96GB B6000 32% B6000 28% B6000 15% B6000 idle face_recognition object_detection video_segmentation 4 slots • 3 active • 1 idle 89,400 frames • up 2d 4h VRAM: 85.2 / 384.0 GB (22%) - 4 slots available Throughput (frames/sec) 200 Batch Calibration (RTX 4090) FPS vs batch size • optimal = 32 1 4 8 16 32 64 128 256 Network Latency Matrix srv-0 srv-1 srv-2 srv-0 - 8ms 12ms srv-1 8ms - 15ms srv-2 12ms 15ms - Scheduler Slots 5 total • 3 active • 0 queued • 2 free ACTIVE ACTIVE ACTIVE FREE FREE Active Jobs job-abc-123 face_recognition 8,240 / 10,000 frames Cluster Topology srv-0 srv-1 srv-2 App A App B 2 clients • 3 nodes • 5 GPUs Real-time WebSocket updates • Prometheus metrics • GPU profiling • Scheduler decisions audit trail
Integrations

External System Integrations

Connect the inference platform to your existing workflows - media asset management, cloud storage, file systems, and more. Coming soon.

Iconik
Iconik
Media Asset Management
Automatically process assets as they're ingested - face recognition, object detection, metadata enrichment. Results written back as Iconik metadata.
CatDV
CatDV
Catalog & Asset Management
Run inference on cataloged media and write results back as metadata, markers, and subclips. Trigger on catalog events.
S3 AWS
Amazon S3
Cloud Object Storage
Watch S3 buckets for new objects. Trigger inference jobs automatically on upload. Write results alongside source files or to a separate output bucket.
File System
Local & Network Storage
Watch directories for new files. Process images and video frames as they land on shared storage, NAS, or local disk. Results written to configurable output paths.
GCS Google
Google Cloud Storage
Cloud Object Storage
Watch GCS buckets for new objects via Pub/Sub notifications. Trigger inference on upload. Write results to the same or different bucket with configurable prefixes.
+
Custom Integration
Build Your Own
Implement the IntegrationHandler interface to connect any storage system. Register a URI scheme and the platform handles the rest - routing, inference, and result delivery.
Integration Handler Architecture Integrations register URI handlers. Client services use URIs to access assets through any backend. Step 1 - Integrations Register Handlers on Install Integration Registry register_handler(scheme, handler) iconik:// → IconikHandler catdv:// → CatDVHandler s3:// → S3Handler file:// → FileHandler your:// → YourHandler Step 2 - Client Service Processes an Asset via URI Client App Handler Layer Iconik Service Inference Server Iconik API get_asset("iconik://asset/abc-123") resolve iconik:// fetch_asset(asset_id: "abc-123") GET /v1/assets/abc-123/proxies {proxy_url, metadata} download proxy file {stream, metadata} {frames, metadata} OPT - Shared mount: handler saves file, returns path {path: "/mnt/shared/abc-123.mp4", metadata} Client has same mount → reads file directly from /mnt/shared/ submit_job("face_recognition") → infer_batch(frames) results (detections, embeddings, ...) write_results("iconik://asset/abc-123", results) update_metadata(asset_id, results) Adding a New Integration - Just Register a Handler class MyStorageHandler implements IntegrationHandler: scheme = "mystorage" // handles mystorage://... URIs fetch_asset(uri) → frames // download & decode write_results(uri, results) // write back Register once → works with all client services, all inference types, all GPU servers
Engineering

Roadmap

Planned enhancements to the inference platform - smarter scheduling policies, cost-aware routing, and secure remote inference.

Time-Based Schedule Policies
Scheduler Enhancement
Allocate GPU slots based on time windows. Full utilization overnight for batch processing, reduced capacity during business hours for interactive workloads. Define policies per GPU, per node, or cluster-wide.
overnight: 100% slots daytime: 50% slots weekends: 75% slots
$
Cost-Based Schedule Policies
Scheduler Enhancement
Prefer cheaper processing paths. Local and intranet GPUs are free - the scheduler favors them. Remote cloud GPUs factor in bandwidth and compute costs. Jobs route to the most cost-effective GPU that meets the performance requirement.
local GPU: $0/hr intranet GPU: $0/hr cloud GPU: $2.50/hr + bandwidth
gRPC Streaming over TLS
Security & Remote Inference
Encrypted gRPC streams for off-site and remote GPU inference. Run inference across data centers, cloud regions, or edge locations with mutual TLS authentication. Extends the cluster beyond the local network securely.
mTLS authentication cross-datacenter edge inference
Throughput Priority Scheduling
Scheduler Enhancement
Live workloads - natural language search, live video - get the fastest GPUs with lowest latency. Background tasks automatically yield to higher-powered GPUs when live inference needs them.
live: fastest GPU + lowest latency background: any available GPU
Throughput Priority - Live vs Background 1 Background job using all GPUs Server 1 - 2x RTX 6000 48GB each • 280 FPS 72% 68% BACKGROUND: batch_process • 50k frames Server 2 - RTX 4090 24GB • 95 FPS 65% BACKGROUND: batch (fan-out) Batch Client 2 Live job requests plan - priority=high, throughput=60%+ Live Video - face_recognition priority=high capacity=60%+ of cluster latency=realtime Scheduler: all slots occupied - need to yield fastest GPU Scores: RTX 6000: 280 FPS, 3ms ✓ best RTX 4090: 95 FPS - doesn't meet 60%+ Contracting background job to free RTX 6000... 3 Background yields RTX 6000 → Live job processes Server 1 - 2x RTX 6000 LIVE • 280 FPS • 3ms 92% 90% LIVE: face_recognition • 280 FPS Server 2 - RTX 4090 background • 95 FPS 85% BACKGROUND: batch (contracted) Live Video Batch Client 4 Live completes → background expands back to all GPUs Server 1 - 2x RTX 6000 returned to background 70% 68% BACKGROUND: batch (expanded back) Server 2 - RTX 4090 95 FPS 65% BACKGROUND: batch_process Batch Client 375 FPS total ✓ The cycle: Background uses all GPUs → Live job queued → Scheduler yields fastest GPU → Live runs at full speed → Live completes → Background expands back. Zero downtime. Live: 280 FPS on 2x RTX 6000 Real-time face rec on live video Background: 95 FPS (while live) Batch continues uninterrupted
Inception

How It All Started

How it all started
Demo

See It in Action

Videos and screenshots of the inference platform running.

Installation Demo

Installation

Get Started in One Command

Install the full inference platform - GPU servers, scheduler, service discovery, monitoring - with a single command. The interactive installer handles everything.

Run this in your terminal:
curl -fsSL "https://llamatron.ai/install.sh" | sh
Requires: at least one NVIDIA GPU.
llamatron-installer