Skip to main content

Architecture

This section describes the production architecture (one region). Environment topology (dev / nonprod / production / Phase 2 IL) is covered separately — see the "Environments" section.

Pattern: Backend-for-Frontend (BFF)

  • Frontend — React SPA, pure UI shell. No secrets, no external API calls, no business logic.
  • Backend — NestJS BFF. Single gateway for all requests: auth, data fetching, validation, business rules, integration with external services.
  • Data flow: React SPA → Cloudflare → AWS ALB → NestJS BFF → PostgreSQL / WorkOS / S3 / external services

Production region architecture (one AWS region)

Internet


┌──────────────────────────────────────────┐
│ Cloudflare (single edge layer) │
│ • WAF (OWASP Core Ruleset + custom) │
│ • DDoS L3/4/7 mitigation │
│ • Rate limit (per-IP, per-API-key) │
│ • Bot management │
│ • CDN (300+ PoPs, static SPA assets) │
│ • Proxied DNS (origin IP hidden) │
│ • TLS 1.2+ Full Strict │
└────────────────┬─────────────────────────┘
│ (only Cloudflare IPs +
│ mTLS Authenticated Origin Pulls)

┌──────────────────────────────────────────┐
│ AWS Production Account (set-prod-eu) │
│ ┌──────────────────────────────────┐ │
│ │ VPC (/16, 3 Availability Zones) │ │
│ │ │ │
│ │ Public subnets (1 per AZ) │ │
│ │ └─ ALB │ │
│ │ (SG: inbound = CF IPs only) │ │
│ │ │ │
│ │ Private subnets — compute │ │
│ │ ├─ Fargate: api (NestJS BFF) │ │
│ │ ├─ Fargate: api-workers │ │
│ │ ├─ Fargate: scanner-workers │ │
│ │ └─ Fargate: ai-workers (Phase 2) │ │
│ │ │ │
│ │ Private subnets — data │ │
│ │ ├─ RDS Postgres 16 Multi-AZ │ │
│ │ │ + 1 Read Replica │ │
│ │ │ + RDS Proxy (pooling) │ │
│ │ ├─ ElastiCache Redis Multi-AZ │ │
│ │ │ (sessions, BullMQ queues) │ │
│ │ └─ Keycloak (Phase 2, AWS IL) │ │
│ │ │ │
│ │ NAT Gateway per AZ │ │
│ │ VPC Endpoints (S3, ECR, Secrets │ │
│ │ Manager, KMS, CloudWatch Logs) │ │
│ └──────────────────────────────────┘ │
│ │
│ Managed / shared services │
│ ├─ S3 (KMS-encrypted, all buckets) │
│ │ ├─ Evidence (versioning + Object Lock)│
│ │ ├─ Documents / reports │
│ │ └─ Backups + CloudTrail archive │
│ ├─ Secrets Manager (DB, JWT, API keys) │
│ ├─ KMS (Customer-Managed Keys) │
│ ├─ SSM Parameter Store (non-secret cfg) │
│ ├─ SES (transactional email) │
│ ├─ SNS (internal alerting fanout) │
│ ├─ CloudTrail multi-region trail │
│ │ → S3 with Object Lock │
│ ├─ AWS Config (resource compliance) │
│ ├─ GuardDuty (threat detection) │
│ ├─ VPC Flow Logs │
│ ├─ AWS Backup (cross-service orchestrate) │
│ ├─ IAM Identity Center (SSO + MFA) │
│ └─ CloudWatch (transit; see Observability)│
└──────────────────────────────────────────┘


┌──────────────────────────────────────────┐
│ External (managed sub-processors) │
│ ├─ WorkOS (auth, EU region) │
│ ├─ Logz.io (observability + SIEM, EU) │
│ └─ Customer-connected systems │
│ (Okta, AWS, GCP, Azure, GitHub, etc.) │
└──────────────────────────────────────────┘

Frontend deployment

  • React SPA built as static files → uploaded to private S3 bucket
  • Served via app.{domain} through Cloudflare → S3 (bucket policy restricts to Cloudflare)
  • Cloudflare CDN caches at 300+ PoPs globally — instant first paint
  • Cache invalidation via Cloudflare API on deploy
  • No CloudFront — Cloudflare handles all CDN duties

Backend deployment

NestJS BFF + workers run on ECS Fargate. Served via api.{domain} through Cloudflare → ALB → Fargate tasks.

Three Fargate task families (four with Phase 2), each scales independently:

Task familyHoldsScales on
apiNestJS BFFrequest count + p99 latency
api-workersNotifications, reports, exports, audit log persistence, webhooks, tenant lifecycle, maintenanceBullMQ queue depth
scanner-workersSecurity scans, evidence ingestion, agent ingestion, compliance computation (rule engine)Queue depth, higher CPU/memory profile
ai-workers (Phase 2)AI Assistant, embedding generation, RAG over evidenceQueue depth

Deployment: rolling updates with circuit breaker + auto-rollback on health-check failure. No long-lived credentials in containers — IAM roles per task family.

Network / Edge detail

Cloudflare (single edge layer):

  • WAF — OWASP Core Ruleset + custom API rules
  • DDoS L3/4/7 mitigation
  • Rate limiting per-IP, per-API-key
  • Bot management
  • TLS 1.2+ Full Strict
  • Edge caching at 300+ PoPs
  • Proxied DNS — origin IP never exposed

ALB:

  • Security group: inbound restricted to Cloudflare IP ranges
  • mTLS Authenticated Origin Pulls — cryptographic verification of authentic Cloudflare traffic
  • HTTPS termination, path-based routing
  • Health-checks Fargate tasks across AZs

AWS Shield Standard — enabled by default, free, covers residual L3/L4 DDoS at the AWS edge.

Intentionally NOT used:

  • AWS WAF — redundant given Cloudflare + IP allowlist + mTLS (see Tech Stack Item 9)
  • AWS Shield Advanced — $3k/mo+ unnecessary; Cloudflare absorbs DDoS at edge
  • CloudFront — Cloudflare handles all CDN duties

Data layer

  • PostgreSQL 16 Multi-AZ — primary + standby across AZs (automatic failover)
  • 1 Read Replica — analytics + heavy dashboard queries (offloads OLTP)
  • RDS Proxy — connection pooling, critical for schema-per-tenant search_path swap discipline
  • Automated backups — 35-day PITR window in production
  • Cross-region encrypted backups for catastrophic recovery
  • KMS encryption at rest with Customer-Managed Keys (CMKs); opens BYOK premium-tier path
  • ElastiCache Redis Multi-AZ — sessions, rate-limit state, query cache, BullMQ queues
  • TimescaleDB extension on Postgres for BCP time-series data (uptime, SLA, dependency-graph metrics) — no separate time-series tier needed
  • pgvector extension available for Phase 2 AI Assistant RAG embeddings

Async / queue layer

  • BullMQ on Redis — primary background-job queue. Richer features than SQS (delayed jobs, priorities, repeating jobs, per-queue rate limiting), and Redis is already in the stack.
  • Dead-letter queue for failed jobs — audit trail of what didn't process
  • Migration to SQS remains a viable path if BullMQ ops ever become problematic — both are well-understood patterns.

Object storage (S3)

  • Evidence bucket — customer-uploaded evidence artifacts. KMS-encrypted with CMK. Versioning + Object Lock for tamper-evidence (regulatory requirement).
  • Documents bucket — generated reports, exports, customer downloads.
  • Backup bucket — RDS exports + CloudTrail archive. Object Lock for 7-year retention.
  • Lifecycle policies — old evidence to Glacier Deep Archive after retention threshold.
  • All buckets accessed via VPC Endpoints — no public-internet egress for AWS-service traffic.

Secrets, keys & config

  • AWS Secrets Manager — DB credentials, JWT signing keys, external API tokens. Auto-rotation, audited.
  • AWS KMS — encryption keys with Customer-Managed Keys (CMKs). Per-environment keys never cross account boundaries. BYOK path for premium tier.
  • AWS SSM Parameter Store — non-secret config.

Security & compliance services

  • AWS GuardDuty — anomaly detection at AWS layer (compromised credentials, anomalous API calls, crypto-mining). Acts as "EDR for AWS infra"; findings feed Logz.io as primary SIEM.
  • AWS Config — resource configuration tracking + drift detection. SOC 2 evidence.
  • CloudTrail — multi-region trail to Object-Locked S3 bucket. Immutable record of every AWS API call.
  • VPC Flow Logs — full network audit trail → CloudWatch → Logz.io.
  • AWS Backup — orchestrates cross-service backup (RDS, S3, EBS).
  • AWS Security Hub — NOT used; Logz.io is the primary alert pane.
  • Automated cross-tenant isolation tests run in CI on every deploy — prove schema-per-tenant works.

Identity & access (internal staff)

  • AWS IAM for human access (not Identity Center / SSO) — IAM users in role-based groups, MFA mandatory, least-privilege policies, no long-lived keys where avoidable (prefer assume-role + temporary creds), root locked down. See the Environments section for detail and the SSO revisit-later note.
  • Break-glass procedure for emergency access — audited via CloudTrail
  • No direct database access — bastion via AWS Session Manager only, fully audited
  • IAM roles for ECS tasks — no long-lived credentials in containers (OIDC, not stored keys)
  • Least privilege per service; quarterly access reviews

Email & notifications

  • AWS SES — transactional email (alerts, scheduled reports, password resets)
  • AWS SNS — internal alerting fanout (email / SMS / Slack via webhook)
  • WorkOS sends its own auth-related email (verification, MFA codes) via AuthKit

Infrastructure as Code

  • Terraform (HCL) — all AWS infrastructure as code
  • Modular: one module per logical layer (vpc, ecs-fargate, rds-postgres, elasticache, s3, logging, observability, iam, secrets)
  • Per-environment tfvars scale resources without code changes
  • Terraform state in shared-services S3 with DynamoDB locking
  • GitHub Actions applies via OIDC + cross-account deploy role (no long-lived AWS keys in CI)

Container registry

  • Amazon ECR in shared-services account — single registry; every environment pulls the same images
  • Image scanning on push (Snyk or ECR-native)
  • Lifecycle policy: retain last 50 images, expire older

Disaster recovery

Failure modeTargetMechanism
Single task / Fargate failureRTO secondsALB removes from rotation; auto-replacement
Single AZ failureRTO < 1 hourMulti-AZ failover on RDS, ALB, Redis; Fargate cross-AZ spread
Data corruptionRPO < 15 minRDS continuous WAL + 5-min snapshots; PITR
Region failureRTO < 24 hoursCross-region encrypted backups + Terraform redeploy runbook

Annual DR drill — tested + documented (SOC 2 mandate). Weekly backup verification — automated restore tests.

Phase 2 — Israel region (AWS Tel Aviv, il-central-1)

Same architecture, deployed as a separate AWS account (set-prod-il) when the first Nimbus-aligned Israeli customer (government / defense / regulated healthcare / banking) signs:

  • Same Terraform modules, IL-specific tfvars
  • Self-hosted Keycloak instance in IL for Israeli data residency
  • Per-tenant routing in NestJS auth middleware (WorkOS for non-Nimbus, Keycloak for Nimbus)
  • Logz.io destination TBD at that time (Israel-region option, or self-hosted ELK)

API Strategy (Dual Protocol)

LayerProtocolWhy
Frontend ↔ BFFGraphQL (@nestjs/graphql)Dashboard aggregation (1 query per view), cross-framework graph data, end-to-end type safety via codegen
Public/Partner APIREST (@nestjs/swagger / OpenAPI)Broad compatibility, simple rate limiting, clean audit trail per endpoint
Both served from same NestJS serverShared auth + tenant middlewareSingle deployment, consistent security layer

GraphQL (Internal — Frontend Only):

  • Powered by Apollo Server via @nestjs/graphql
  • Schema-first approach with GraphQL Code Generator → auto-generated TypeScript types + React hooks
  • Relay-style cursor pagination (max 100 per page) to prevent DoS
  • DataLoaders for batching/deduplication (prevent N+1 queries)
  • Query depth/complexity limiting to prevent abuse
  • Compliance dashboards, cross-framework mapping, and evidence graphs all benefit from single-query aggregation

REST (External — Public API):

  • OpenAPI 3.0 spec with Swagger UI documentation
  • API key authentication (scoped per tenant, per-endpoint read/write permissions)
  • Rate limiting (per-key, per-IP)
  • Versioned endpoints (v1, v2)
  • Webhook support for real-time event notifications
  • Every mutation logged as an audit event