Introduction
With the explosion of distributed frameworks and on-demand inference APIs, you might think it is enough to plug an existing orchestrator into a GPU cluster and run an LLM service at scale.
But very quickly, certain technical and economic obstacles become apparent.
The Limits of API-as-a-Service
Cloud APIs (OpenAI, Mistral, etc.) quickly impose rate limits -- quotas per minute, per hour -- that become constraining as soon as you want to scale properly, especially in a local or sovereign service model.
And even when the limits are acceptable, the dependency on a remote, uncontrolled resource creates problems of latency, resilience, and cost.
Spinning Up GPU Servers: A False Good Idea?
You might think that dynamically provisioning GPU machines gives you freedom. But here too, you hit concrete limitations:
- How long does it really take to spin up an hourly GPU server?
- How many simultaneous requests can they handle?
- How does that compare to more accessible servers (like Apple M4)?
- And most importantly: is it cost-effective?
What I Observed in Practice
I have not yet tested Kubernetes, Ray, or other distributed orchestrators, but I anticipate unnecessary complexity for my use case. What I am looking for is a system that is:
- Simple to deploy,
- Optimized for local, heterogeneous, and dynamic resources,
- Capable of handling resilience without overengineering.
Most solutions I evaluated are oversized, too verbose to deploy, not sovereign enough (cloud dependencies that are hard to audit), and above all poorly optimized for a local, modular, and frugal model -- which is precisely the DNA of mAIstrow.
A Homegrown Orchestrator, Written in Rust
It is in this context that I decided to design my own orchestrator. A lightweight, loosely coupled, asynchronous, and resilient-by-design system, built to leverage:
- Fixed or volatile servers (with or without GPUs),
- Various inference engines (llama.cpp, vLLM, Ollama, etc.),
- Intelligent fallback strategies (streaming, timeouts, automatic retries).
The goal: stop being constrained by infrastructure limitations, and instead build a software foundation that works here, now, and tomorrow, regardless of the available resources.
Here are the choices I made -- and the tests I plan to run to validate them.
1. A Three-Body System: Hub, Engine, Client
The architecture rests on three components:
- The Hub-server: it centralizes everything. It receives requests, schedules them, dispatches them, and tracks them. It owns the database, the timeouts, and the auto-scaling logic.
- The Engine-services: they execute tasks (LLM inference, embeddings, vector search, etc.). They are silent, frugal workers that can disengage at any time.
- The User-clients: they send requests. That is all. The Hub handles the rest.
A client disappears? No problem, the task continues. An Engine goes down? No problem, the task is relaunched elsewhere. The Hub restarts? In-flight tasks are restored.
This decoupling is the key to resilience. Everything else is implementation detail.
2. The Task as a First-Class Citizen
Everything revolves around one concept: the Task.
It is a generic Rust struct. Whether it is an LLM request, an embedding computation, or a hybrid search -- it remains a Task.
Each task contains a payload, an optional context for resumption, and a status that is persisted on every update. The Hub is the only writer to the database. The Engine simply streams updates.
3. The Hub: Single Brain, Centralized Memory
The Hub receives all requests. It tracks every task, pings every Engine, measures latency, and retries when needed. It manages timeouts, priorities, and can decide to request more capacity.
No need for Kafka. No need for Ray. Everything is streamed, controlled, and retriable via a simple timer and an SQL table.
4. The Engine: Stateless, but Not Without Honor
Engine-services are designed to die gracefully. They hold no durable state. They receive a Task, execute it, stream the results, say "done," and potentially shut down if they are no longer needed.
This design enables native scalability. A single command from the Hub is enough to instantiate a new GPU process on a Scaleway, OVH, or even local server.
5. Resilience Without Overhead
This system tolerates the disconnection of the client, the engine, or even the hub -- without ever breaking business logic.
The database is updated continuously, each Engine sends its state, and if nothing happens for too long, the Hub automatically retries. Simple, readable, testable.
What Comes Next?
I have started prototyping this system in Rust, with a simple stack (Tokio, SQLx, Actix or Axum). The goal is to make it self-deployable, observable, and predictive, through logs, dynamic thresholds, and metrics.
The design is intentionally minimalist. This is not a generic framework: it is a tool forged for a specific need -- orchestrating LLM inference on heterogeneous resources, with resilience as the primary constraint.
The next parts in this series will cover distributed architecture, comparisons with Kafka, Pulsar, and Ray, and the details of asynchronous implementation in Rust.