2026-01-25

What Clusterra Deploys in Your AWS Account: Components, Costs, and Permissions

A detailed breakdown of every AWS resource Clusterra deploys in your account, what permissions they require, and how much they cost. For AWS account owners who want to know exactly what they're running.

When you connect a ParallelCluster to Clusterra, we deploy a small set of AWS resources into your account. This post is a complete inventory — every resource, every permission, every cost.

No hand-waving. If you're the AWS account owner reviewing this for security approval, this is your reference.

TL;DR: The Complete Inventory

Resource Purpose Monthly Cost Estimate
VPC Lattice Service Exposes Clusterra API ~$18 + $0.025/GB
SQS Queue Receives job/node events ~$0.40/million messages
SQS Dead Letter Queue Captures failed messages ~$0.40/million messages
Lambda Function Ships events to Clusterra API ~$0.20/million invocations
CloudWatch Event Rules (3) Captures EC2/ASG events Free
IAM Roles (3) Permissions for above resources Free

Total estimated cost: $18-22/month for a typical cluster with moderate job volume.

Architecture Overview

Architecture Overview

(Architecture diagram pending update to reflect VPC Lattice integration)


Resource #1: VPC Lattice Service & Target Group

What it does: Exposes slurmrestd on your head node to the Clusterra control plane securely.

Why it's needed: Replaces complex NLB + PrivateLink setups. VPC Lattice provides application-layer networking that connects Clusterra to your private cluster without exposing it to the internet.

resource "aws_vpclattice_service" "slurm_api" {
  name      = "clusterra-svc-${var.cluster_id}"
  auth_type = "NONE" # Auth handled by Slurm JWTs
}

resource "aws_vpclattice_target_group" "slurm_api" {
  name = "clusterra-tg-${var.cluster_id}"
  type = "INSTANCE"
  config {
    port = 6830
    vpc_identifier = var.vpc_id
  }
}

Cost breakdown: - Service association: ~$0.025/hour × 730 hours = ~$18.25/month (approximated based on region) - Data processing: ~$0.025/GB

Security notes: - No public IP or internet gateway required - Access controlled via IAM Auth Policies - Traffic stays on the AWS backbone


Resource #3: SQS Queue

What it does: Receives events from Slurm hooks and CloudWatch.

Why it's needed: Decouples event generation from event shipping. Hooks write to SQS (fast, async), Lambda processes later.

resource "aws_sqs_queue" "events" {
  name                       = "clusterra-events-${cluster_name}"
  visibility_timeout_seconds = 60
  message_retention_seconds  = 86400  # 1 day
  receive_wait_time_seconds  = 20     # Long polling
}

Cost breakdown: - First 1M requests/month: Free - After: $0.40/million requests - Typical cluster with 10K jobs/month: < $1/month

Security notes: - No public access - Only Lambda and ParallelCluster instance roles can read/write - Messages are event metadata only — no job content


Resource #4: SQS Dead Letter Queue

What it does: Captures messages that fail processing after 3 attempts.

Why it's needed: Prevents event loss. Failed messages go here for debugging instead of being discarded.

resource "aws_sqs_queue" "events_dlq" {
  name                      = "clusterra-events-${cluster_name}-dlq"
  message_retention_seconds = 604800  # 7 days
}

Cost: Same as main queue — typically negligible unless you have chronic failures.


Resource #5: Lambda Function

What it does: Reads from SQS, batches events, POSTs to Clusterra API.

Why it's needed: Serverless event shipper. No agent to install or maintain.

resource "aws_lambda_function" "event_shipper" {
  function_name = "clusterra-event-shipper-${cluster_name}"
  runtime       = "python3.11"
  handler       = "handler.handler"
  timeout       = 30
  memory_size   = 128

  environment {
    variables = {
      CLUSTER_ID        = var.cluster_id
      TENANT_ID         = var.tenant_id
      CLUSTERRA_API_URL = "https://api.clusterra.cloud"
    }
  }
}

Cost breakdown: - Requests: $0.20/million invocations - Duration: $0.0000166667/GB-second - Typical: 128MB × 0.5s × 10K invocations = < $0.10/month

Security notes: - Execution role has minimal permissions (SQS read only) - Only makes outbound HTTPS calls to api.clusterra.cloud - No VPC attachment — uses public internet for API calls


Resource #6: CloudWatch Event Rules (3)

What it does: Captures EC2 and ASG events, forwards to SQS.

Why it's needed: Head node state changes and compute node lifecycle without polling.

# Rule 1: EC2 instance state changes
resource "aws_cloudwatch_event_rule" "ec2_state" {
  name = "clusterra-ec2-state-${cluster_name}"
  event_pattern = jsonencode({
    source      = ["aws.ec2"]
    detail-type = ["EC2 Instance State-change Notification"]
  })
}

# Rule 2: ASG launch/terminate
resource "aws_cloudwatch_event_rule" "asg_events" {
  name = "clusterra-asg-${cluster_name}"
  event_pattern = jsonencode({
    source      = ["aws.autoscaling"]
    detail-type = ["EC2 Instance Launch Successful", "EC2 Instance Terminate Successful"]
  })
}

# Rule 3: Spot interruptions
resource "aws_cloudwatch_event_rule" "spot_interruption" {
  name = "clusterra-spot-${cluster_name}"
  event_pattern = jsonencode({
    source      = ["aws.ec2"]
    detail-type = ["EC2 Spot Instance Interruption Warning"]
  })
}

Cost: Free (EventBridge rules don't cost anything)


IAM Permissions: What Clusterra Can Access

IAM Permissions: What Clusterra Can Access

(Diagram pending update)

IAM Role #1: Cross-Account Role (for Clusterra)

This role allows Clusterra's API to call your slurmrestd:

{
  "Effect": "Allow",
  "Action": [
    "secretsmanager:GetSecretValue"
  ],
  "Resource": "arn:aws:secretsmanager:*:*:secret:${jwt_secret_name}"
}

What it CAN do: - Read the JWT secret to authenticate with slurmrestd - Nothing else

What it CANNOT do: - Access EC2, S3, EFS, FSx, or any other resource - SSH to any instance - Read job scripts or outputs - Access your VPC networking

IAM Role #2: Lambda Execution Role

{
  "Effect": "Allow",
  "Action": [
    "sqs:ReceiveMessage",
    "sqs:DeleteMessage",
    "sqs:GetQueueAttributes"
  ],
  "Resource": "arn:aws:sqs:*:*:clusterra-events-*"
},
{
  "Effect": "Allow",
  "Action": [
    "logs:CreateLogGroup",
    "logs:CreateLogStream",
    "logs:PutLogEvents"
  ],
  "Resource": "arn:aws:logs:*:*:*"
}

IAM Role #3: ParallelCluster Instance Role Addition

We add this policy to your existing ParallelCluster instance role:

{
  "Effect": "Allow",
  "Action": "sqs:SendMessage",
  "Resource": "arn:aws:sqs:*:*:clusterra-events-*"
}

This is the only change to your existing ParallelCluster setup.


What Events We Collect

What Events We Collect

(Diagram pending update)

Events are metadata only — never job content, scripts, or outputs.

Event Type Data Collected
job.started job_id, user, partition, node, timestamp
job.completed job_id, exit_code, timestamp
job.failed job_id, exit_code, state, timestamp
node.launched instance_id, ASG name, timestamp
node.terminated instance_id, timestamp
node.spot_interrupted instance_id, action, timestamp
cluster.state.started instance_id (head node), timestamp
cluster.state.stopped instance_id, timestamp

What We Do NOT Have Access To

To be explicit:

Resource Clusterra Access
Head node SSH ❌ None
Compute node SSH ❌ None
EFS/FSx filesystems ❌ None
S3 buckets ❌ None
Job scripts ❌ None
Job outputs ❌ None
VPC networking ❌ None
EC2 instance control ❌ None

The only things Clusterra can do: 1. Call slurmrestd API (via PrivateLink + JWT) 2. Receive events you explicitly send to SQS


Deployment: One Terraform Apply

All resources are deployed via OpenTofu/Terraform:

# Clone and configure
git clone https://github.com/clusterra/clusterra-connect
cd clusterra-connect
cp terraform.tfvars.example terraform.tfvars
# Edit with your values

# Deploy
tofu init
tofu apply

Takes ~5 minutes. Creates all resources in your account.


Summary

Category Details
Resources created 7 (NLB, VPC Endpoint, 2 SQS, Lambda, 3 CloudWatch Rules)
Monthly cost ~$16-20
Permissions granted SQS send/receive, Secrets Manager read (JWT only)
Data sent to Clusterra Event metadata only
Data NOT accessible SSH, filesystems, job content, EC2 control

Every resource is tagged with ManagedBy: OpenTOFU for easy identification.


Questions about specific permissions or resources? Email security@clusterra.cloud — we're happy to provide detailed scoping for your security review.