Architecture

Clusterra is a two-side system. The central side is our shared control plane — one K3s cluster per environment that hosts the Slurm control planes, the cluster-api service, and the scaling logic. The customer side is your K3s cluster in your AWS account where slurmd workers actually run. The two sides talk over VPC peering.

CENTRAL (Clusterra account, us-east-1) cluster-api (Go) Auth, agent, proxy, scaling decisions Slinky operator One namespace per tenant: slurmctld, slurmdbd, slurmrestd, login, MariaDB Shared scaling loop 20s poll + heartbeat responses DynamoDB tenants, users, clusters, events, chats CUSTOMER (your account) K3s control plane Fixed EC2, runs Karpenter + edge-agent slurmd Deployments One per shape × size; pods land on Karpenter nodes EFS Shared /mnt/efs for scripts + stdout/stderr S3 files bucket User uploads, per-user + shared prefixes VPC peering slurmd → slurmctld:6817 edge-agent → heartbeat

The central side

The central cluster is a single K3s node group we run in us-east-1. It is multi-tenant: each customer gets a dedicated Kubernetes namespace containing their Slurm control plane (a Slinky SlurmCluster custom resource). Those Slinky namespaces include slurmctld, slurmdbd, slurmrestd, the login pod used by the browser terminal, and a per-tenant MariaDB CR.

Sitting in front of everything is cluster-api, a Go service. It authenticates browser sessions against Google OIDC, issues short-lived JWTs, proxies requests to the right tenant's slurmrestd, embeds an LLM agent, and runs the scaling decision loop. It is the only thing browsers and the edge-agent talk to.

The customer side

In your AWS account we stand up a minimal K3s cluster on a single fixed EC2 control-plane node. That node runs:

How a job flows

  1. You submit a job in the console. The browser calls POST /v1/clusters/{id}/jobs/submit on api-use1.clusterra.cloud.
  2. cluster-api injects your Linux UID into the request and proxies it to the tenant's slurmrestd. slurmctld accepts the job as PENDING.
  3. The scaling loop reads pending jobs, maps each to a slurmd shape by its RAM/vCPU ratio, and writes a desired replica count to DynamoDB.
  4. The edge-agent heartbeats, receives the scaling commands, and patches the slurmd Deployments. Karpenter sees Pending pods, provisions EC2 nodes from the right instance family, and Cilium attaches them to the VPC.
  5. slurmd starts, registers with central slurmctld over VPC peering, and slurmctld dispatches the job. Output streams to /mnt/efs/job_{id}.out.
  6. When the queue empties, Karpenter consolidates the nodes away within a few minutes.

Why two sides

Running slurmctld centrally means we can patch Slurm, rotate secrets, and upgrade the scheduler without coordinating a maintenance window with every customer. Running the workers in your account means your EC2 bill is your EC2 bill, your IAM boundary is your IAM boundary, and your data never leaves your VPC.

The contract between the two sides is deliberately narrow: a single slurmctld TCP endpoint on port 6817, a heartbeat HTTP endpoint for the edge-agent, and an ArgoCD control plane so fleet-wide config changes (Helm values, operator upgrades) can be rolled out without touching your AWS console.

State of the world

StoreWhereContains
DynamoDBCentral AWS accountTenants, users, clusters, events, chat history, memories
MariaDB (per-tenant)Central K3sslurmdbd accounting — job history, associations, QOS
K8s SecretsCentral K3sSlurm auth keys, session signing key (synced from Secrets Manager)
EFSCustomer AWS accountJob scripts, stdout/stderr, scratch
S3 files bucketCustomer AWS accountUser-uploaded inputs, outputs, shared datasets