2026-01-17

Clusterra Architecture: What Runs in Your AWS Account vs Ours

When teams evaluate Clusterra, one of the first questions is understandably about boundaries: What runs in our AWS account, what runs in yours, and how do they connect?

When teams evaluate Clusterra, one of the first questions is understandably about boundaries: What runs in our AWS account, what runs in yours, and how do they connect?

Clusterra is built around a split architecture that keeps compute, scheduling, and data entirely in the customer’s AWS account, while providing a centralized control plane for identity, coordination, and visibility.

This post explains that split and how the two sides communicate.

Clusterra Architecture Diagram

What Runs in Your AWS Account

All infrastructure that actually runs jobs or touches data lives in the customer’s AWS account.

This includes:

Slurm clusters provisioned using AWS ParallelCluster (head node, compute nodes, schedulers, autoscaling)
Slurmrestd, running on the head node
All job execution and scheduling
All storage (FSx for Lustre, EBS, S3, etc.)
Customer-owned IAM roles, VPCs, and networking

Clusterra is deployed into the customer account using infrastructure-as-code (OpenTofu / ParallelCluster). There is no requirement to expose head nodes publicly or allow inbound access from the internet.

Clusterra does not:

execute jobs on your behalf
access job payloads or file systems
require SSH access to your cluster nodes

What Runs in Clusterra’s AWS Account

Clusterra operates a centralized control plane that provides coordination across users and clusters.

This includes:

Web console
Public APIs used by the console, CLI, and integrations
User authentication and identity (OIDC with Okta / Entra ID)
Customer configuration metadata
Event processing and delivery
Cost and usage aggregation

The control plane does not have direct network access to customer clusters or storage.

How Clusterra Connects to Your Cluster

Clusterra interacts with customer clusters through a modern, serverless networking layer built on AWS VPC Lattice.

At a high level:

Requests from users hit the Clusterra control plane
A Clusterra-managed Lambda (the bridge) is invoked
The bridge connects to the customer cluster privately via AWS VPC Lattice
All communication terminates at slurmrestd on the head node

For reporting status back to Clusterra (Upstream):

The cluster uses a hybrid, agentless approach to push events: 1. Job state changes are sent directly to the Clusterra API via lightweight non-blocking HTTP hooks (using standard curl &). 2. Infrastructure events (like node scaling) are captured by CloudWatch and routed securely via Amazon EventBridge.

This architecture removes the need for polling agents, long-running daemons, or complex message queues in your account.

There is no public endpoint on the cluster, no VPN management, and no inbound access from the internet.

Authentication and Authorization

Clusterra does not manage SSH keys or static Linux users.

Instead:

Users authenticate using customer-managed OIDC (Okta / Entra ID)
Requests are translated into short-lived Slurm JWTs
slurmrestd validates these tokens and forwards requests to slurmctld

This model avoids:

distributing SSH keys
maintaining long-lived service credentials
manual user provisioning on cluster nodes

Access is enforced at the Slurm API layer, not via shell access to the head node.

Events and State Visibility

Clusterra treats cluster state as event-driven.

The platform emits structured events for:

Job lifecycle (submitted, pending, running, completed, failed)
Node lifecycle (provisioning, active, drained, terminated)
User activity

These events enable integrations such as:

Slack notifications
CI/CD hooks
Automated workflows around job completion or failure
Cost and usage alerts

Events are metadata-only; job inputs and outputs remain in the customer account.

Why This Architecture Matters

This separation is intentional and conservative.

It enables:

Clear security boundaries: No SSH access, no root privileges, no broad cross-account trust.
Data sovereignty: Compute and data never leave the customer’s AWS account.
Operational safety: Centralized visibility without centralizing execution.
Easier security reviews: Narrow interfaces are easier to reason about than full cluster access.

Summary

Clusterra centralizes control and visibility, not compute.

Your AWS account runs Slurm, jobs, and storage
Clusterra runs identity, APIs, events, and coordination
The two connect privately through a scoped bridge to slurmrestd

This model allows teams to operate Slurm clusters safely at scale without changing how jobs are run or who owns the infrastructure.

Footnote: Clusterra is built by the former Product Manager for AWS Batch and AWS Parallel Computing Service, informed by operating large-scale HPC systems in production.