Architecture

This section describes the production architecture (one region). Environment topology (dev / nonprod / production / Phase 2 IL) is covered separately — see the "Environments" section.

Pattern: Backend-for-Frontend (BFF)

Frontend — React SPA, pure UI shell. No secrets, no external API calls, no business logic.
Backend — NestJS BFF. Single gateway for all requests: auth, data fetching, validation, business rules, integration with external services.
Data flow: React SPA → Cloudflare → AWS ALB → NestJS BFF → PostgreSQL / WorkOS / S3 / external services

Production region architecture (one AWS region)

                              Internet
                                  │
                                  ▼
              ┌──────────────────────────────────────────┐
              │  Cloudflare (single edge layer)          │
              │  • WAF (OWASP Core Ruleset + custom)     │
              │  • DDoS L3/4/7 mitigation                │
              │  • Rate limit (per-IP, per-API-key)      │
              │  • Bot management                        │
              │  • CDN (300+ PoPs, static SPA assets)    │
              │  • Proxied DNS (origin IP hidden)        │
              │  • TLS 1.2+ Full Strict                  │
              └────────────────┬─────────────────────────┘
                               │  (only Cloudflare IPs +
                               │   mTLS Authenticated Origin Pulls)
                               ▼
              ┌──────────────────────────────────────────┐
              │  AWS Production Account (set-prod-eu)    │
              │  ┌──────────────────────────────────┐    │
              │  │ VPC (/16, 3 Availability Zones)   │    │
              │  │                                   │    │
              │  │ Public subnets (1 per AZ)         │    │
              │  │  └─ ALB                            │    │
              │  │      (SG: inbound = CF IPs only)   │    │
              │  │                                   │    │
              │  │ Private subnets — compute         │    │
              │  │  ├─ Fargate: api (NestJS BFF)    │    │
              │  │  ├─ Fargate: api-workers          │    │
              │  │  ├─ Fargate: scanner-workers      │    │
              │  │  └─ Fargate: ai-workers (Phase 2) │    │
              │  │                                   │    │
              │  │ Private subnets — data            │    │
              │  │  ├─ RDS Postgres 16 Multi-AZ      │    │
              │  │  │   + 1 Read Replica             │    │
              │  │  │   + RDS Proxy (pooling)        │    │
              │  │  ├─ ElastiCache Redis Multi-AZ    │    │
              │  │  │   (sessions, BullMQ queues)    │    │
              │  │  └─ Keycloak (Phase 2, AWS IL)    │    │
              │  │                                   │    │
              │  │ NAT Gateway per AZ                │    │
              │  │ VPC Endpoints (S3, ECR, Secrets   │    │
              │  │   Manager, KMS, CloudWatch Logs)  │    │
              │  └──────────────────────────────────┘    │
              │                                            │
              │  Managed / shared services                 │
              │  ├─ S3 (KMS-encrypted, all buckets)        │
              │  │   ├─ Evidence (versioning + Object Lock)│
              │  │   ├─ Documents / reports                │
              │  │   └─ Backups + CloudTrail archive       │
              │  ├─ Secrets Manager (DB, JWT, API keys)    │
              │  ├─ KMS (Customer-Managed Keys)            │
              │  ├─ SSM Parameter Store (non-secret cfg)   │
              │  ├─ SES (transactional email)              │
              │  ├─ SNS (internal alerting fanout)         │
              │  ├─ CloudTrail multi-region trail          │
              │  │   → S3 with Object Lock                 │
              │  ├─ AWS Config (resource compliance)       │
              │  ├─ GuardDuty (threat detection)           │
              │  ├─ VPC Flow Logs                          │
              │  ├─ AWS Backup (cross-service orchestrate) │
              │  ├─ IAM Identity Center (SSO + MFA)        │
              │  └─ CloudWatch (transit; see Observability)│
              └──────────────────────────────────────────┘
                               │
                               ▼
              ┌──────────────────────────────────────────┐
              │  External (managed sub-processors)        │
              │  ├─ WorkOS (auth, EU region)              │
              │  ├─ Logz.io (observability + SIEM, EU)    │
              │  └─ Customer-connected systems            │
              │     (Okta, AWS, GCP, Azure, GitHub, etc.) │
              └──────────────────────────────────────────┘

Frontend deployment

React SPA built as static files → uploaded to private S3 bucket
Served via app.{domain} through Cloudflare → S3 (bucket policy restricts to Cloudflare)
Cloudflare CDN caches at 300+ PoPs globally — instant first paint
Cache invalidation via Cloudflare API on deploy
No CloudFront — Cloudflare handles all CDN duties

Backend deployment

NestJS BFF + workers run on ECS Fargate. Served via api.{domain} through Cloudflare → ALB → Fargate tasks.

Three Fargate task families (four with Phase 2), each scales independently:

Task family	Holds	Scales on
`api`	NestJS BFF	request count + p99 latency
`api-workers`	Notifications, reports, exports, audit log persistence, webhooks, tenant lifecycle, maintenance	BullMQ queue depth
`scanner-workers`	Security scans, evidence ingestion, agent ingestion, compliance computation (rule engine)	Queue depth, higher CPU/memory profile
`ai-workers` (Phase 2)	AI Assistant, embedding generation, RAG over evidence	Queue depth

Deployment: rolling updates with circuit breaker + auto-rollback on health-check failure. No long-lived credentials in containers — IAM roles per task family.

Network / Edge detail

Cloudflare (single edge layer):

WAF — OWASP Core Ruleset + custom API rules
DDoS L3/4/7 mitigation
Rate limiting per-IP, per-API-key
Bot management
TLS 1.2+ Full Strict
Edge caching at 300+ PoPs
Proxied DNS — origin IP never exposed

ALB:

Security group: inbound restricted to Cloudflare IP ranges
mTLS Authenticated Origin Pulls — cryptographic verification of authentic Cloudflare traffic
HTTPS termination, path-based routing
Health-checks Fargate tasks across AZs

AWS Shield Standard — enabled by default, free, covers residual L3/L4 DDoS at the AWS edge.

Intentionally NOT used:

AWS WAF — redundant given Cloudflare + IP allowlist + mTLS (see Tech Stack Item 9)
AWS Shield Advanced — $3k/mo+ unnecessary; Cloudflare absorbs DDoS at edge
CloudFront — Cloudflare handles all CDN duties

Data layer

PostgreSQL 16 Multi-AZ — primary + standby across AZs (automatic failover)
1 Read Replica — analytics + heavy dashboard queries (offloads OLTP)
RDS Proxy — connection pooling, critical for schema-per-tenant search_path swap discipline
Automated backups — 35-day PITR window in production
Cross-region encrypted backups for catastrophic recovery
KMS encryption at rest with Customer-Managed Keys (CMKs); opens BYOK premium-tier path
ElastiCache Redis Multi-AZ — sessions, rate-limit state, query cache, BullMQ queues
TimescaleDB extension on Postgres for BCP time-series data (uptime, SLA, dependency-graph metrics) — no separate time-series tier needed
pgvector extension available for Phase 2 AI Assistant RAG embeddings

Async / queue layer

BullMQ on Redis — primary background-job queue. Richer features than SQS (delayed jobs, priorities, repeating jobs, per-queue rate limiting), and Redis is already in the stack.
Dead-letter queue for failed jobs — audit trail of what didn't process
Migration to SQS remains a viable path if BullMQ ops ever become problematic — both are well-understood patterns.

Object storage (S3)

Evidence bucket — customer-uploaded evidence artifacts. KMS-encrypted with CMK. Versioning + Object Lock for tamper-evidence (regulatory requirement).
Documents bucket — generated reports, exports, customer downloads.
Backup bucket — RDS exports + CloudTrail archive. Object Lock for 7-year retention.
Lifecycle policies — old evidence to Glacier Deep Archive after retention threshold.
All buckets accessed via VPC Endpoints — no public-internet egress for AWS-service traffic.

Secrets, keys & config

AWS Secrets Manager — DB credentials, JWT signing keys, external API tokens. Auto-rotation, audited.
AWS KMS — encryption keys with Customer-Managed Keys (CMKs). Per-environment keys never cross account boundaries. BYOK path for premium tier.
AWS SSM Parameter Store — non-secret config.

Security & compliance services

AWS GuardDuty — anomaly detection at AWS layer (compromised credentials, anomalous API calls, crypto-mining). Acts as "EDR for AWS infra"; findings feed Logz.io as primary SIEM.
AWS Config — resource configuration tracking + drift detection. SOC 2 evidence.
CloudTrail — multi-region trail to Object-Locked S3 bucket. Immutable record of every AWS API call.
VPC Flow Logs — full network audit trail → CloudWatch → Logz.io.
AWS Backup — orchestrates cross-service backup (RDS, S3, EBS).
AWS Security Hub — NOT used; Logz.io is the primary alert pane.
Automated cross-tenant isolation tests run in CI on every deploy — prove schema-per-tenant works.

Identity & access (internal staff)

AWS IAM for human access (not Identity Center / SSO) — IAM users in role-based groups, MFA mandatory, least-privilege policies, no long-lived keys where avoidable (prefer assume-role + temporary creds), root locked down. See the Environments section for detail and the SSO revisit-later note.
Break-glass procedure for emergency access — audited via CloudTrail
No direct database access — bastion via AWS Session Manager only, fully audited
IAM roles for ECS tasks — no long-lived credentials in containers (OIDC, not stored keys)
Least privilege per service; quarterly access reviews

Email & notifications

AWS SES — transactional email (alerts, scheduled reports, password resets)
AWS SNS — internal alerting fanout (email / SMS / Slack via webhook)
WorkOS sends its own auth-related email (verification, MFA codes) via AuthKit

Infrastructure as Code

Terraform (HCL) — all AWS infrastructure as code
Modular: one module per logical layer (vpc, ecs-fargate, rds-postgres, elasticache, s3, logging, observability, iam, secrets)
Per-environment tfvars scale resources without code changes
Terraform state in shared-services S3 with DynamoDB locking
GitHub Actions applies via OIDC + cross-account deploy role (no long-lived AWS keys in CI)

Container registry

Amazon ECR in shared-services account — single registry; every environment pulls the same images
Image scanning on push (Snyk or ECR-native)
Lifecycle policy: retain last 50 images, expire older

Disaster recovery

Failure mode	Target	Mechanism
Single task / Fargate failure	RTO seconds	ALB removes from rotation; auto-replacement
Single AZ failure	RTO < 1 hour	Multi-AZ failover on RDS, ALB, Redis; Fargate cross-AZ spread
Data corruption	RPO < 15 min	RDS continuous WAL + 5-min snapshots; PITR
Region failure	RTO < 24 hours	Cross-region encrypted backups + Terraform redeploy runbook

Annual DR drill — tested + documented (SOC 2 mandate). Weekly backup verification — automated restore tests.

Phase 2 — Israel region (AWS Tel Aviv, `il-central-1`)

Same architecture, deployed as a separate AWS account (set-prod-il) when the first Nimbus-aligned Israeli customer (government / defense / regulated healthcare / banking) signs:

Same Terraform modules, IL-specific tfvars
Self-hosted Keycloak instance in IL for Israeli data residency
Per-tenant routing in NestJS auth middleware (WorkOS for non-Nimbus, Keycloak for Nimbus)
Logz.io destination TBD at that time (Israel-region option, or self-hosted ELK)

API Strategy (Dual Protocol)

Layer	Protocol	Why
Frontend ↔ BFF	GraphQL (`@nestjs/graphql`)	Dashboard aggregation (1 query per view), cross-framework graph data, end-to-end type safety via codegen
Public/Partner API	REST (`@nestjs/swagger` / OpenAPI)	Broad compatibility, simple rate limiting, clean audit trail per endpoint
Both served from same NestJS server	Shared auth + tenant middleware	Single deployment, consistent security layer

GraphQL (Internal — Frontend Only):

Powered by Apollo Server via @nestjs/graphql
Schema-first approach with GraphQL Code Generator → auto-generated TypeScript types + React hooks
Relay-style cursor pagination (max 100 per page) to prevent DoS
DataLoaders for batching/deduplication (prevent N+1 queries)
Query depth/complexity limiting to prevent abuse
Compliance dashboards, cross-framework mapping, and evidence graphs all benefit from single-query aggregation

REST (External — Public API):

OpenAPI 3.0 spec with Swagger UI documentation
API key authentication (scoped per tenant, per-endpoint read/write permissions)
Rate limiting (per-key, per-IP)
Versioned endpoints (v1, v2)
Webhook support for real-time event notifications
Every mutation logged as an audit event

Pattern: Backend-for-Frontend (BFF)​

Production region architecture (one AWS region)​

Frontend deployment​

Backend deployment​

Network / Edge detail​

Data layer​

Async / queue layer​

Object storage (S3)​

Secrets, keys & config​

Security & compliance services​

Identity & access (internal staff)​

Email & notifications​

Infrastructure as Code​

Container registry​

Disaster recovery​

Phase 2 — Israel region (AWS Tel Aviv, il-central-1)​

API Strategy (Dual Protocol)​