User guide

A tour of the things you actually do day-to-day: submit jobs, stream logs, manage files, understand scaling, and stay under quota.

Submitting jobs

You can submit jobs three ways, all of which land in the same slurmctld.

Console → new job. Fill in a form. Good for one-offs and for picking a template (GROMACS, AMBER, Nextflow).
srun / sbatch from the login shell. Open the in-browser terminal (a WebSocket into the tenant's login pod) and use Slurm the way you would on any cluster.
HTTP API. POST /v1/clusters/{id}/jobs/submit. Good for pipelines, CI, or any automation. See Calling the API.

The submit body

Raw submissions use the slurmrestd v0.0.44 shape. Two things that trip people up:

script goes at the top level of the JSON body, not inside job.
nodes must be a string ("1"), not an integer. environment is required — ["PATH=/usr/bin:/bin"] is fine as a minimum.

Templates

Clusterra ships a catalog of parameterized job templates. Call GET /v1/templates to list them and GET /v1/templates/{id} for the parameter schema. When you submit with a template_id and a params object, the central API renders the final script and submits it for you.

Streaming logs

Every job's stdout lands at /mnt/efs/job_{job_id}.out on the shared EFS mount. Clusterra exposes two endpoints to read it:

GET /v1/clusters/{id}/jobs/{job_id}/output — the current full file.
GET /v1/clusters/{id}/jobs/{job_id}/output/stream — a Server-Sent Events stream that tails the file and pushes new lines as they are written.

The console uses the streaming endpoint, which is why the log panel feels live rather than polled.

Files

Each cluster has an S3 bucket in your account. The console's Files tab is a thin UI over it. File keys are namespaced by location:

scripts/ — submit scripts you want to version.
data/ — inputs. Typically uploaded once, read many times.
outputs/ — where your jobs write results.
shared/ — read-only across all users in your workspace.

Uploads go through a presigned PUT: POST /v1/clusters/{id}/storage/presigned-url returns a URL you upload to directly from the browser; the file never passes through Clusterra. Downloads work the same way with a presigned GET.

Everything under scripts/, data/, and outputs/ is scoped to your email prefix — you can't read other users' files, and they can't read yours. Only shared/ is cross-user, and only an admin can upload there.

Autoscaling

Clusterra maps each pending job to a slurmd shape based on its RAM/vCPU ratio:

Family	RAM per vCPU	Example EC2
`compute`	≤ 3 GiB	c6g, c7g
`general`	≤ 6 GiB	m6g, m7g
`memory`	> 6 GiB	r6g, r7g

Within a family we pick a size (xs, small, medium, large, xlarge) based on the biggest pending job. The scaler writes desired replicas to DynamoDB; the edge-agent patches the matching slurmd Deployment; Karpenter brings the EC2 node up; Slurm dispatches the job.

When the queue empties, Karpenter scales nodes down after a short idle window. There is no idle fleet.

Cold-start time. First job on a fresh cluster typically starts in 60–120 seconds: Karpenter (~30 s) + Cilium (~30 s) + slurmd registration (~20 s). Subsequent jobs that fit on an already-warm node start in under a second.

Cost and quota

Every poll cycle the central API calculates your current burn rate (sum of per-node hourly rate across running jobs) and pending bill (sum of elapsed seconds × rate). Both are on your profile at GET /v1/users/me and visible in the console's usage page.

Workspace admins can set per-user quotas. The default enforcement mode warns at the limit; block_and_kill cancels running jobs when projected spend exceeds limit × (1 + buffer%). Cancellations use the standard DELETE /v1/clusters/{id}/jobs/{job_id} path, so they show up in event history like any other cancel.

The agent

Ask AI in the console opens a chat panel tied to the same cluster-api service. The agent has access to:

A set of domain docs (Slurm, GROMACS, AMBER, Nextflow) loaded at service startup.
Your current cluster state — running jobs, node shapes, pending queue — through internal tool calls.
Your private memory store. When you ask it to remember something (“my default partition is slurm-workers”) it saves a memory you can list or delete at /v1/agent/memories.

It can also be asked to do things — submit a job, cancel one, fetch a schema. Destructive actions (anything that submits, cancels, or deletes) require you to confirm in the UI before they run.

Events

Every scaling decision, job state transition, and admin action is written to a per-cluster event stream. Read it with GET /v1/clusters/{id}/events or subscribe to GET /v1/clusters/{id}/events/stream for a live SSE feed. The console's activity panel is the subscribed view.