Scalability Series - Part 2
Introduction
Before diving into the async architecture of mAIstrow, let us take a moment to look at what already exists. Kafka, Pulsar, Redpanda, Ray... these names come up whenever distributed systems are discussed. OpenAI and Google handle millions of users, but they do not tell you how. I want to be transparent, learn from existing solutions, and decide -- with full knowledge -- whether to draw inspiration from them or stay true to my vision: a distributed, sovereign, and ecological AI built on Rust.
Because, let us be honest: what if mAIstrow had to handle tens of thousands of users? It is not a question of "can we" -- it is a question of "how."
Why Explore Distributed Frameworks?
When I worked on Plan 9, I understood something fundamental: to innovate is also to understand what already exists. You do not reinvent the wheel -- you look at it, measure it, compare it, then decide whether to build it better.
My three-body system (server, AI engine, interface) is lightweight, local, and designed for sovereignty. But if mAIstrow were to become a public platform with thousands of users, I would need to understand how the giants do it. Not to copy, but to learn, and to decide whether to adapt or surpass.
That is why I explore Kafka, Pulsar, Redpanda, and Ray here -- not to abandon Rust, but to refine my choices and keep control over what I build.
Apache Kafka: The Streaming Giant
Kafka is a monolith of performance. It would replace my Rust server with a cluster of brokers and my requests/responses with topics:
- The interface sends a request to a
requeststopic. - The AI engines, as consumer groups, read it, process it, and publish the response to
responses. - Kafka handles automatic distribution via partitions and resilience through replication.
It is powerful. It is what powers Netflix, LinkedIn, and Uber. But it is also massively complex.
My current system is centralized, lightweight, and local. A single Rust/Go server, a simple round-robin, SQLite for persistence. Kafka requires a cluster of brokers, Zookeeper (or equivalent) for coordination, and a full infrastructure.
Yet Kafka inspires me. Its partitions show how to distribute load without depending on a single server. Its consumer groups embody resilience by nature.
But my system remains more sovereign, simpler, and more flexible thanks to my Rust traits. Kafka is a tool for mass. I want a tool for mastery.
The Challenges of Scaling Kafka in Production
Scaling Kafka in production comes with significant challenges:
- Network saturation: as the number of brokers grows, replication and synchronization can saturate the internal cluster network.
- Operational complexity: partition balancing, monitoring, upgrades, configuration management.
- Partitioning limits: too few partitions hinder scaling; too many overload metadata and coordination.
- CAP tradeoffs: like any distributed system, Kafka must balance consistency, availability, and partition tolerance.
Apache Pulsar: A Modern, Modular Alternative
Pulsar distinguishes itself through its layered architecture:
- Brokers (stateless) handle streaming.
- Bookies (Apache BookKeeper) handle storage, separately.
This separation allows scaling storage independently from compute. You can add brokers for throughput or bookies for capacity, without massive rebalancing or downtime.
Pulsar is more modular than Kafka, handles multi-tenancy natively, and supports built-in geo-replication. It is tempting.
But once again: it is too heavy for my use cases. My system is built to run on a Raspberry Pi, an old laptop, or a VM at a friend's house. Pulsar, even in its "light" version, requires a cluster.
Pulsar would be a good candidate if I wanted to isolate data flows for schools, labs, or educational projects. But for mAIstrow, I prefer to keep total control.
Redpanda: The Lightweight Kafka-Compatible Option
Redpanda is a dream for those who value lightness:
- Compatible with the Kafka API (via
rdkafkain Rust), - Written in C++, highly performant,
- Can run locally, even on a Raspberry Pi,
- Single binary, no Java or Zookeeper dependency.
Imagine: my Rust server becomes a Redpanda broker. The requests and responses topics exist. The interface and engines use rdkafka to communicate. It is almost like my current system, but with automatic distribution, built-in resilience, and obvious scalability.
Redpanda is the first framework I want to test. Not to replace my system, but to experiment with it. A proof of concept with a local broker, a Rust client, and a Transport trait that can switch between WebSocket and Kafka/Redpanda.
Ray: The Distributed Computing Champion
Ray is different. It does not manage streaming -- it manages distributed computing.
In Ray:
- A central actor (the "head") orchestrates tasks.
- AI engines become Ray actors, capable of receiving messages, computing, and returning results.
- The interface sends requests to the actor, which distributes them.
Ray is designed for machine learning, training, and parallel inference. Its architecture rests on dedicated primitives: Tasks (distributable stateless functions) and Actors (distributable stateful objects), each capable of specifying CPU/GPU requirements.
But Ray is often Python-based and cloud-oriented. My system is Rust, local, and modular. It is an option if I need to handle thousands of users with heavy computation. But for now, I prefer to stick with my traits, my abstractions, and my control.
What OpenAI, Google, and Anthropic Do
The AI giants handle millions of users. They likely use proprietary distributed systems, mixing:
- Message brokers (Kafka-like),
- Compute frameworks (Ray-like),
- Distributed databases (Bigtable, Spanner),
- Cloud infrastructure (AWS, GCP, Azure).
But they do not talk about it. Not because they are secretive -- but because it is their competitive advantage. Their silence is a lesson.
The model is clearly cloud-centric: each depends on a proprietary cloud optimized for their AI workload, which closes the door on open resource access and open innovation.
I want to be different. I want mAIstrow to be transparent. I want people to know how it works, why it is designed this way, and what I learned building it.
Why Stay with Rust?
So, why not switch to Kafka, Pulsar, or Ray?
Because Rust is my tool for mastery, not just performance.
- Performance: Rust is fast, safe, and garbage-collector-free. Ideal for systems where every millisecond counts.
- Total control: I decide the protocol, the persistence, the distribution. No black box.
- Flexibility: my traits allow me to switch transport (WebSocket to Kafka to Redpanda) without rewriting the entire system.
- Frugality: I do not want 100 dependencies. I want a system that runs on an old laptop.
Kafka, Pulsar, Redpanda, Ray -- they are scalability beasts. But mAIstrow is a sovereignty beast.
Next Steps
- Test a POC with Redpanda: use
rdkafkain Rust, create topics, integrate a Transport trait that switches between WebSocket and Kafka/Redpanda. - Explore
ray-rs: if I need to handle heavy parallel computation, try a prototype with Ray actors in Rust. - Stay modular: the system remains lightweight, local, and sovereign, but it can draw inspiration from these tools without becoming heavy.
Quick Glossary
- Kafka: Distributed message broker, widely used for streaming.
- Pulsar: Modern alternative to Kafka, with storage/compute separation.
- Redpanda: Kafka-compatible, lightweight, written in C++, ideal for POCs.
- Ray: Distributed computing framework, designed for ML, often Python-based.
- Rust traits: Polymorphism mechanism, ideal for abstracting protocols.
- POC: Proof of Concept -- a quick test to validate an idea.