Architecture

Clusterra is a two-side system. The central side is our shared control plane — one K3s cluster per environment that hosts the Slurm control planes, the cluster-api service, and the scaling logic. The customer side is your K3s cluster in your AWS account where slurmd workers actually run. The two sides talk over VPC peering.

The central side

The central cluster is a single K3s node group we run in us-east-1. It is multi-tenant: each customer gets a dedicated Kubernetes namespace containing their Slurm control plane (a Slinky SlurmCluster custom resource). Those Slinky namespaces include slurmctld, slurmdbd, slurmrestd, the login pod used by the browser terminal, and a per-tenant MariaDB CR.

Sitting in front of everything is cluster-api, a Go service. It authenticates browser sessions against Google OIDC, issues short-lived JWTs, proxies requests to the right tenant's slurmrestd, embeds an LLM agent, and runs the scaling decision loop. It is the only thing browsers and the edge-agent talk to.

The customer side

In your AWS account we stand up a minimal K3s cluster on a single fixed EC2 control-plane node. That node runs:

Karpenter for autoscaling compute nodes. Consolidation policy is WhenEmpty so it never evicts a running slurmd pod.
Cilium in ENI IPAM mode, which gives pods VPC-routable IPs. That is what lets a slurmd pod in your VPC register with slurmctld in ours.
The edge-agent, which sends a heartbeat to the central API every few seconds and carries out the scaling commands it gets back.
A set of slurmd Deployments, one per combination of shape (compute / general / memory) and size (xs / small / medium / large / xlarge). Each Deployment has replicas: 0 until the scaler turns it on.

How a job flows

You submit a job in the console. The browser calls POST /v1/clusters/{id}/jobs/submit on api-use1.clusterra.cloud.
cluster-api injects your Linux UID into the request and proxies it to the tenant's slurmrestd. slurmctld accepts the job as PENDING.
The scaling loop reads pending jobs, maps each to a slurmd shape by its RAM/vCPU ratio, and writes a desired replica count to DynamoDB.
The edge-agent heartbeats, receives the scaling commands, and patches the slurmd Deployments. Karpenter sees Pending pods, provisions EC2 nodes from the right instance family, and Cilium attaches them to the VPC.
slurmd starts, registers with central slurmctld over VPC peering, and slurmctld dispatches the job. Output streams to /mnt/efs/job_{id}.out.
When the queue empties, Karpenter consolidates the nodes away within a few minutes.

Why two sides

Running slurmctld centrally means we can patch Slurm, rotate secrets, and upgrade the scheduler without coordinating a maintenance window with every customer. Running the workers in your account means your EC2 bill is your EC2 bill, your IAM boundary is your IAM boundary, and your data never leaves your VPC.

The contract between the two sides is deliberately narrow: a single slurmctld TCP endpoint on port 6817, a heartbeat HTTP endpoint for the edge-agent, and an ArgoCD control plane so fleet-wide config changes (Helm values, operator upgrades) can be rolled out without touching your AWS console.

State of the world

Store	Where	Contains
DynamoDB	Central AWS account	Tenants, users, clusters, events, chat history, memories
MariaDB (per-tenant)	Central K3s	slurmdbd accounting — job history, associations, QOS
K8s Secrets	Central K3s	Slurm auth keys, session signing key (synced from Secrets Manager)
EFS	Customer AWS account	Job scripts, stdout/stderr, scratch
S3 files bucket	Customer AWS account	User-uploaded inputs, outputs, shared datasets