Architecture
Clusterra is a two-side system. The central side is our shared control plane — one K3s cluster per environment that hosts the Slurm control planes, the cluster-api service, and the scaling logic. The customer side is your K3s cluster in your AWS account where slurmd workers actually run. The two sides talk over VPC peering.
The central side
The central cluster is a single K3s node group we run in
us-east-1. It is multi-tenant: each customer gets a
dedicated Kubernetes namespace containing their Slurm control plane
(a Slinky SlurmCluster custom resource). Those Slinky
namespaces include slurmctld, slurmdbd,
slurmrestd, the login pod used by the
browser terminal, and a per-tenant MariaDB CR.
Sitting in front of everything is cluster-api, a Go service. It authenticates browser sessions against Google OIDC, issues short-lived JWTs, proxies requests to the right tenant's slurmrestd, embeds an LLM agent, and runs the scaling decision loop. It is the only thing browsers and the edge-agent talk to.
The customer side
In your AWS account we stand up a minimal K3s cluster on a single fixed EC2 control-plane node. That node runs:
- Karpenter for autoscaling compute nodes. Consolidation policy is
WhenEmptyso it never evicts a running slurmd pod. - Cilium in ENI IPAM mode, which gives pods VPC-routable IPs. That is what lets a slurmd pod in your VPC register with slurmctld in ours.
- The edge-agent, which sends a heartbeat to the central API every few seconds and carries out the scaling commands it gets back.
- A set of slurmd Deployments, one per combination of shape (compute / general / memory) and size (xs / small / medium / large / xlarge). Each Deployment has
replicas: 0until the scaler turns it on.
How a job flows
- You submit a job in the console. The browser calls
POST /v1/clusters/{id}/jobs/submitonapi-use1.clusterra.cloud. - cluster-api injects your Linux UID into the request and proxies it to the tenant's slurmrestd. slurmctld accepts the job as PENDING.
- The scaling loop reads pending jobs, maps each to a slurmd shape by its RAM/vCPU ratio, and writes a desired replica count to DynamoDB.
- The edge-agent heartbeats, receives the scaling commands, and patches the slurmd Deployments. Karpenter sees Pending pods, provisions EC2 nodes from the right instance family, and Cilium attaches them to the VPC.
- slurmd starts, registers with central slurmctld over VPC peering, and slurmctld dispatches the job. Output streams to
/mnt/efs/job_{id}.out. - When the queue empties, Karpenter consolidates the nodes away within a few minutes.
Why two sides
Running slurmctld centrally means we can patch Slurm, rotate secrets, and upgrade the scheduler without coordinating a maintenance window with every customer. Running the workers in your account means your EC2 bill is your EC2 bill, your IAM boundary is your IAM boundary, and your data never leaves your VPC.
The contract between the two sides is deliberately narrow: a single slurmctld TCP endpoint on port 6817, a heartbeat HTTP endpoint for the edge-agent, and an ArgoCD control plane so fleet-wide config changes (Helm values, operator upgrades) can be rolled out without touching your AWS console.
State of the world
| Store | Where | Contains |
|---|---|---|
| DynamoDB | Central AWS account | Tenants, users, clusters, events, chat history, memories |
| MariaDB (per-tenant) | Central K3s | slurmdbd accounting — job history, associations, QOS |
| K8s Secrets | Central K3s | Slurm auth keys, session signing key (synced from Secrets Manager) |
| EFS | Customer AWS account | Job scripts, stdout/stderr, scratch |
| S3 files bucket | Customer AWS account | User-uploaded inputs, outputs, shared datasets |