User guide
A tour of the things you actually do day-to-day: submit jobs, stream logs, manage files, understand scaling, and stay under quota.
Submitting jobs
You can submit jobs three ways, all of which land in the same slurmctld.
- Console → new job. Fill in a form. Good for one-offs and for picking a template (GROMACS, AMBER, Nextflow).
srun/sbatchfrom the login shell. Open the in-browser terminal (a WebSocket into the tenant's login pod) and use Slurm the way you would on any cluster.- HTTP API.
POST /v1/clusters/{id}/jobs/submit. Good for pipelines, CI, or any automation. See Calling the API.
The submit body
Raw submissions use the slurmrestd v0.0.44 shape. Two things that trip people up:
scriptgoes at the top level of the JSON body, not insidejob.nodesmust be a string ("1"), not an integer.environmentis required —["PATH=/usr/bin:/bin"]is fine as a minimum.
Templates
Clusterra ships a catalog of parameterized job templates. Call
GET /v1/templates to list them and
GET /v1/templates/{id} for the parameter schema. When
you submit with a template_id and a params
object, the central API renders the final script and submits it for
you.
Streaming logs
Every job's stdout lands at /mnt/efs/job_{job_id}.out
on the shared EFS mount. Clusterra exposes two endpoints to read it:
GET /v1/clusters/{id}/jobs/{job_id}/output— the current full file.GET /v1/clusters/{id}/jobs/{job_id}/output/stream— a Server-Sent Events stream that tails the file and pushes new lines as they are written.
The console uses the streaming endpoint, which is why the log panel feels live rather than polled.
Files
Each cluster has an S3 bucket in your account. The console's Files tab is a thin UI over it. File keys are namespaced by location:
scripts/— submit scripts you want to version.data/— inputs. Typically uploaded once, read many times.outputs/— where your jobs write results.shared/— read-only across all users in your workspace.
Uploads go through a presigned PUT:
POST /v1/clusters/{id}/storage/presigned-url returns a URL
you upload to directly from the browser; the file never passes through
Clusterra. Downloads work the same way with a presigned GET.
Everything under scripts/, data/, and
outputs/ is scoped to your email prefix — you can't
read other users' files, and they can't read yours. Only
shared/ is cross-user, and only an admin can upload
there.
Autoscaling
Clusterra maps each pending job to a slurmd shape based on its RAM/vCPU ratio:
| Family | RAM per vCPU | Example EC2 |
|---|---|---|
compute | ≤ 3 GiB | c6g, c7g |
general | ≤ 6 GiB | m6g, m7g |
memory | > 6 GiB | r6g, r7g |
Within a family we pick a size (xs, small,
medium, large, xlarge) based
on the biggest pending job. The scaler writes desired replicas to
DynamoDB; the edge-agent patches the matching slurmd Deployment;
Karpenter brings the EC2 node up; Slurm dispatches the job.
When the queue empties, Karpenter scales nodes down after a short idle window. There is no idle fleet.
Cost and quota
Every poll cycle the central API calculates your current burn rate
(sum of per-node hourly rate across running jobs) and pending bill
(sum of elapsed seconds × rate). Both are on your profile at
GET /v1/users/me and visible in the console's usage
page.
Workspace admins can set per-user quotas. The default enforcement
mode warns at the limit; block_and_kill cancels running
jobs when projected spend exceeds limit × (1 + buffer%).
Cancellations use the standard DELETE /v1/clusters/{id}/jobs/{job_id}
path, so they show up in event history like any other cancel.
The agent
Ask AI in the console opens a chat panel tied to the same cluster-api service. The agent has access to:
- A set of domain docs (Slurm, GROMACS, AMBER, Nextflow) loaded at service startup.
- Your current cluster state — running jobs, node shapes, pending queue — through internal tool calls.
- Your private memory store. When you ask it to remember something (“my default partition is slurm-workers”) it saves a memory you can list or delete at
/v1/agent/memories.
It can also be asked to do things — submit a job, cancel one, fetch a schema. Destructive actions (anything that submits, cancels, or deletes) require you to confirm in the UI before they run.
Events
Every scaling decision, job state transition, and admin action is
written to a per-cluster event stream. Read it with
GET /v1/clusters/{id}/events or subscribe to
GET /v1/clusters/{id}/events/stream for a live SSE feed.
The console's activity panel is the subscribed view.