Getting started

From zero to a running job in about fifteen minutes. You will connect your AWS account via a cross-account IAM role, let Clusterra provision the compute-side cluster, then submit your first Slurm job from the console.

1. Create your workspace

Visit console.clusterra.cloud and sign in with Google. Your first sign-in creates a workspace using your email domain as the tenant slug. If your domain already has a workspace, you will be invited to join it.

2. Deploy the cross-account IAM role

Clusterra runs your slurmd workers, the edge-agent, and Karpenter inside your AWS account. That requires a cross-account IAM role the central control plane can assume via STS with an external ID.

From the console, choose Connect cluster → Deploy role. The console will render a CloudFormation quick-create link pre-filled with:

Prefer Terraform? The same role can be created with aws_iam_role + aws_iam_policy_attachment. The console shows the full policy JSON so you can vendor it into your own IaC.

3. Connect the cluster

Once the role exists, paste its ARN into the console. Clusterra makes a pre-flight sts:AssumeRole call to verify the trust policy and external ID are correct — this catches typos before the twelve-minute provision begins.

Behind the scenes the central API launches a Kubernetes Job that runs customer-provision.sh against your account. It:

  1. Creates a VPC, subnets, and a K3s control-plane EC2 instance.
  2. Mounts an EFS filesystem that all slurmd pods will share.
  3. Publishes the K3s join token to SSM Parameter Store.
  4. Installs Karpenter, Cilium, the edge-agent, and an initial set of slurmd Deployments (one per instance-shape/size combo).
  5. Brings up VPC peering back to the central account.
  6. Registers the cluster with the central ArgoCD instance so fleet-level changes can be rolled out per tenant.

When the cluster status flips to running in the console, the central slurmctld sees zero workers — that is normal. Workers scale up on demand when jobs land.

4. Submit your first job

From the Jobs tab, click New job. You have two options:

A minimal raw submission looks like this:

{
  "script": "#!/bin/bash\n#SBATCH --cpus-per-task=2\n#SBATCH --mem=4G\nsrun hostname",
  "job": {
    "name": "hello-world",
    "nodes": "1",
    "current_working_directory": "/mnt/efs",
    "environment": ["PATH=/usr/bin:/bin"]
  }
}

The pending job triggers the scaling loop: the central API looks at the resource request, picks a slurmd shape (compute / general / memory family, xs through xlarge), and tells the edge-agent to scale that Deployment up. Karpenter notices the Pending pod, provisions an EC2 node, Cilium hooks it into the VPC, and slurmd registers with slurmctld. Jobs typically start within 60–120 seconds on a cold cluster.

5. Watch it run

Once the job is RUNNING, click into it for live stdout/stderr. Clusterra reads the log file directly from EFS and streams it over a Server-Sent Events connection, so what you see in the console is always current.

When the job finishes, the console shows the exit code, walltime, and a cost estimate based on the EC2 rate for the nodes it ran on.

What to read next