Focus on your business logic. The platform handles GPU resource planning, automatic model profiling, dynamic scaling, and intelligent job routing - even with a single GPU.
Your application talks to one simple client. Behind the scenes, the inference network manages resources, routes jobs, profiles hardware, and scales automatically.
The full sequence from job submission to inference streaming. The client library handles all of this - your app just calls submit, connect, infer.
Simple mode sends everything through one gateway - the server fans out internally. Cluster mode connects directly to all servers and distributes frames by measured throughput.
Works with any topology. Single node with one GPU, multi-GPU nodes, or a full cluster. The system adapts to whatever hardware you have.
In simple mode, your client sends frames to one server. The gateway batches them and processes locally. If scoring indicates it's beneficial, it fans out to peers - otherwise it handles everything itself.
In cluster mode, the client connects directly to all servers, distributes frames by measured throughput, and adapts weights dynamically using EMA smoothing.
When a server starts, it profiles every model: VRAM usage, throughput at each batch size, multi-slot capability, memory hierarchy. No manual configuration needed.
Models move between Disk, RAM, and VRAM automatically. Multiple plugins can share the same model via reference counting. In step mode, models swap between RAM and VRAM between pipeline stages.
Each server can run local GPU plugins or connect to external inference engines. The abstraction layer makes all backends look the same to the scheduler.
One node becomes the scheduler leader via Consul distributed locking. The leader handles all scheduling decisions: job assignment, plan adjustments, queue promotion, and health monitoring.
When a new node joins the cluster, it bootstraps automatically. The cluster then runs network latency probes and recalculates routing parameters across all nodes.
Jobs are intelligently scored against all available GPUs, assigned with fan-out when beneficial, and prioritized when capacity is limited. The scheduler makes data-driven routing decisions.
When all GPUs are busy, your job enters a priority queue. The client library polls automatically, waiting for the scheduler leader to assign a slot and accept the token.
When a new job arrives and the cluster is busy, the scheduler contracts existing plans - the running client drops a connection so the new job can be served. When capacity frees up, plans expand back.
Add GPU nodes anytime - they bootstrap, benchmark, and start serving. If a node goes down, the system adapts instantly: connections reroute, plans update, processing continues.
The Command Center dashboard shows real-time cluster health, GPU utilization, VRAM allocation, job status, throughput metrics, network latency, and batch calibration - all in one view.
Connect the inference platform to your existing workflows - media asset management, cloud storage, file systems, and more. Coming soon.
Planned enhancements to the inference platform - smarter scheduling policies, cost-aware routing, and secure remote inference.
Videos and screenshots of the inference platform running.
Install the full inference platform - GPU servers, scheduler, service discovery, monitoring - with a single command. The interactive installer handles everything.
curl -fsSL "https://llamatron.ai/install.sh" | sh