User guide

A tour of the things you actually do day-to-day: submit jobs, stream logs, manage files, understand scaling, and stay under quota.

Submitting jobs

You can submit jobs three ways, all of which land in the same slurmctld.

The submit body

Raw submissions use the slurmrestd v0.0.44 shape. Two things that trip people up:

Templates

Clusterra ships a catalog of parameterized job templates. Call GET /v1/templates to list them and GET /v1/templates/{id} for the parameter schema. When you submit with a template_id and a params object, the central API renders the final script and submits it for you.

Streaming logs

Every job's stdout lands at /mnt/efs/job_{job_id}.out on the shared EFS mount. Clusterra exposes two endpoints to read it:

The console uses the streaming endpoint, which is why the log panel feels live rather than polled.

Files

Each cluster has an S3 bucket in your account. The console's Files tab is a thin UI over it. File keys are namespaced by location:

Uploads go through a presigned PUT: POST /v1/clusters/{id}/storage/presigned-url returns a URL you upload to directly from the browser; the file never passes through Clusterra. Downloads work the same way with a presigned GET.

Everything under scripts/, data/, and outputs/ is scoped to your email prefix — you can't read other users' files, and they can't read yours. Only shared/ is cross-user, and only an admin can upload there.

Autoscaling

Clusterra maps each pending job to a slurmd shape based on its RAM/vCPU ratio:

FamilyRAM per vCPUExample EC2
compute≤ 3 GiBc6g, c7g
general≤ 6 GiBm6g, m7g
memory> 6 GiBr6g, r7g

Within a family we pick a size (xs, small, medium, large, xlarge) based on the biggest pending job. The scaler writes desired replicas to DynamoDB; the edge-agent patches the matching slurmd Deployment; Karpenter brings the EC2 node up; Slurm dispatches the job.

When the queue empties, Karpenter scales nodes down after a short idle window. There is no idle fleet.

Cold-start time. First job on a fresh cluster typically starts in 60–120 seconds: Karpenter (~30 s) + Cilium (~30 s) + slurmd registration (~20 s). Subsequent jobs that fit on an already-warm node start in under a second.

Cost and quota

Every poll cycle the central API calculates your current burn rate (sum of per-node hourly rate across running jobs) and pending bill (sum of elapsed seconds × rate). Both are on your profile at GET /v1/users/me and visible in the console's usage page.

Workspace admins can set per-user quotas. The default enforcement mode warns at the limit; block_and_kill cancels running jobs when projected spend exceeds limit × (1 + buffer%). Cancellations use the standard DELETE /v1/clusters/{id}/jobs/{job_id} path, so they show up in event history like any other cancel.

The agent

Ask AI in the console opens a chat panel tied to the same cluster-api service. The agent has access to:

It can also be asked to do things — submit a job, cancel one, fetch a schema. Destructive actions (anything that submits, cancels, or deletes) require you to confirm in the UI before they run.

Events

Every scaling decision, job state transition, and admin action is written to a per-cluster event stream. Read it with GET /v1/clusters/{id}/events or subscribe to GET /v1/clusters/{id}/events/stream for a live SSE feed. The console's activity panel is the subscribed view.