Architecture
This section describes the production architecture (one region). Environment topology (dev / nonprod / production / Phase 2 IL) is covered separately — see the "Environments" section.
Pattern: Backend-for-Frontend (BFF)
- Frontend — React SPA, pure UI shell. No secrets, no external API calls, no business logic.
- Backend — NestJS BFF. Single gateway for all requests: auth, data fetching, validation, business rules, integration with external services.
- Data flow: React SPA → Cloudflare → AWS ALB → NestJS BFF → PostgreSQL / WorkOS / S3 / external services
Production region architecture (one AWS region)
Internet
│
▼
┌──────────────────────────────────────────┐
│ Cloudflare (single edge layer) │
│ • WAF (OWASP Core Ruleset + custom) │
│ • DDoS L3/4/7 mitigation │
│ • Rate limit (per-IP, per-API-key) │
│ • Bot management │
│ • CDN (300+ PoPs, static SPA assets) │
│ • Proxied DNS (origin IP hidden) │
│ • TLS 1.2+ Full Strict │
└────────────────┬─────────────────────────┘
│ (only Cloudflare IPs +
│ mTLS Authenticated Origin Pulls)
▼
┌──────────────────────────────────────────┐
│ AWS Production Account (set-prod-eu) │
│ ┌──────────────────────────────────┐ │
│ │ VPC (/16, 3 Availability Zones) │ │
│ │ │ │
│ │ Public subnets (1 per AZ) │ │
│ │ └─ ALB │ │
│ │ (SG: inbound = CF IPs only) │ │
│ │ │ │
│ │ Private subnets — compute │ │
│ │ ├─ Fargate: api (NestJS BFF) │ │
│ │ ├─ Fargate: api-workers │ │
│ │ ├─ Fargate: scanner-workers │ │
│ │ └─ Fargate: ai-workers (Phase 2) │ │
│ │ │ │
│ │ Private subnets — data │ │
│ │ ├─ RDS Postgres 16 Multi-AZ │ │
│ │ │ + 1 Read Replica │ │
│ │ │ + RDS Proxy (pooling) │ │
│ │ ├─ ElastiCache Redis Multi-AZ │ │
│ │ │ (sessions, BullMQ queues) │ │
│ │ └─ Keycloak (Phase 2, AWS IL) │ │
│ │ │ │
│ │ NAT Gateway per AZ │ │
│ │ VPC Endpoints (S3, ECR, Secrets │ │
│ │ Manager, KMS, CloudWatch Logs) │ │
│ └──────────────────────────────────┘ │
│ │
│ Managed / shared services │
│ ├─ S3 (KMS-encrypted, all buckets) │
│ │ ├─ Evidence (versioning + Object Lock)│
│ │ ├─ Documents / reports │
│ │ └─ Backups + CloudTrail archive │
│ ├─ Secrets Manager (DB, JWT, API keys) │
│ ├─ KMS (Customer-Managed Keys) │
│ ├─ SSM Parameter Store (non-secret cfg) │
│ ├─ SES (transactional email) │
│ ├─ SNS (internal alerting fanout) │
│ ├─ CloudTrail multi-region trail │
│ │ → S3 with Object Lock │
│ ├─ AWS Config (resource compliance) │
│ ├─ GuardDuty (threat detection) │
│ ├─ VPC Flow Logs │
│ ├─ AWS Backup (cross-service orchestrate) │
│ ├─ IAM Identity Center (SSO + MFA) │
│ └─ CloudWatch (transit; see Observability)│
└──────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────┐
│ External (managed sub-processors) │
│ ├─ WorkOS (auth, EU region) │
│ ├─ Logz.io (observability + SIEM, EU) │
│ └─ Customer-connected systems │
│ (Okta, AWS, GCP, Azure, GitHub, etc.) │
└──────────────────────────────────────────┘
Frontend deployment
- React SPA built as static files → uploaded to private S3 bucket
- Served via
app.{domain}through Cloudflare → S3 (bucket policy restricts to Cloudflare) - Cloudflare CDN caches at 300+ PoPs globally — instant first paint
- Cache invalidation via Cloudflare API on deploy
- No CloudFront — Cloudflare handles all CDN duties
Backend deployment
NestJS BFF + workers run on ECS Fargate. Served via api.{domain} through Cloudflare → ALB → Fargate tasks.
Three Fargate task families (four with Phase 2), each scales independently:
| Task family | Holds | Scales on |
|---|---|---|
api | NestJS BFF | request count + p99 latency |
api-workers | Notifications, reports, exports, audit log persistence, webhooks, tenant lifecycle, maintenance | BullMQ queue depth |
scanner-workers | Security scans, evidence ingestion, agent ingestion, compliance computation (rule engine) | Queue depth, higher CPU/memory profile |
ai-workers (Phase 2) | AI Assistant, embedding generation, RAG over evidence | Queue depth |
Deployment: rolling updates with circuit breaker + auto-rollback on health-check failure. No long-lived credentials in containers — IAM roles per task family.
Network / Edge detail
Cloudflare (single edge layer):
- WAF — OWASP Core Ruleset + custom API rules
- DDoS L3/4/7 mitigation
- Rate limiting per-IP, per-API-key
- Bot management
- TLS 1.2+ Full Strict
- Edge caching at 300+ PoPs
- Proxied DNS — origin IP never exposed
ALB:
- Security group: inbound restricted to Cloudflare IP ranges
- mTLS Authenticated Origin Pulls — cryptographic verification of authentic Cloudflare traffic
- HTTPS termination, path-based routing
- Health-checks Fargate tasks across AZs
AWS Shield Standard — enabled by default, free, covers residual L3/L4 DDoS at the AWS edge.
Intentionally NOT used:
- AWS WAF — redundant given Cloudflare + IP allowlist + mTLS (see Tech Stack Item 9)
- AWS Shield Advanced — $3k/mo+ unnecessary; Cloudflare absorbs DDoS at edge
- CloudFront — Cloudflare handles all CDN duties
Data layer
- PostgreSQL 16 Multi-AZ — primary + standby across AZs (automatic failover)
- 1 Read Replica — analytics + heavy dashboard queries (offloads OLTP)
- RDS Proxy — connection pooling, critical for schema-per-tenant
search_pathswap discipline - Automated backups — 35-day PITR window in production
- Cross-region encrypted backups for catastrophic recovery
- KMS encryption at rest with Customer-Managed Keys (CMKs); opens BYOK premium-tier path
- ElastiCache Redis Multi-AZ — sessions, rate-limit state, query cache, BullMQ queues
- TimescaleDB extension on Postgres for BCP time-series data (uptime, SLA, dependency-graph metrics) — no separate time-series tier needed
- pgvector extension available for Phase 2 AI Assistant RAG embeddings
Async / queue layer
- BullMQ on Redis — primary background-job queue. Richer features than SQS (delayed jobs, priorities, repeating jobs, per-queue rate limiting), and Redis is already in the stack.
- Dead-letter queue for failed jobs — audit trail of what didn't process
- Migration to SQS remains a viable path if BullMQ ops ever become problematic — both are well-understood patterns.
Object storage (S3)
- Evidence bucket — customer-uploaded evidence artifacts. KMS-encrypted with CMK. Versioning + Object Lock for tamper-evidence (regulatory requirement).
- Documents bucket — generated reports, exports, customer downloads.
- Backup bucket — RDS exports + CloudTrail archive. Object Lock for 7-year retention.
- Lifecycle policies — old evidence to Glacier Deep Archive after retention threshold.
- All buckets accessed via VPC Endpoints — no public-internet egress for AWS-service traffic.
Secrets, keys & config
- AWS Secrets Manager — DB credentials, JWT signing keys, external API tokens. Auto-rotation, audited.
- AWS KMS — encryption keys with Customer-Managed Keys (CMKs). Per-environment keys never cross account boundaries. BYOK path for premium tier.
- AWS SSM Parameter Store — non-secret config.
Security & compliance services
- AWS GuardDuty — anomaly detection at AWS layer (compromised credentials, anomalous API calls, crypto-mining). Acts as "EDR for AWS infra"; findings feed Logz.io as primary SIEM.
- AWS Config — resource configuration tracking + drift detection. SOC 2 evidence.
- CloudTrail — multi-region trail to Object-Locked S3 bucket. Immutable record of every AWS API call.
- VPC Flow Logs — full network audit trail → CloudWatch → Logz.io.
- AWS Backup — orchestrates cross-service backup (RDS, S3, EBS).
- AWS Security Hub — NOT used; Logz.io is the primary alert pane.
- Automated cross-tenant isolation tests run in CI on every deploy — prove schema-per-tenant works.
Identity & access (internal staff)
- AWS IAM for human access (not Identity Center / SSO) — IAM users in role-based groups, MFA mandatory, least-privilege policies, no long-lived keys where avoidable (prefer assume-role + temporary creds), root locked down. See the Environments section for detail and the SSO revisit-later note.
- Break-glass procedure for emergency access — audited via CloudTrail
- No direct database access — bastion via AWS Session Manager only, fully audited
- IAM roles for ECS tasks — no long-lived credentials in containers (OIDC, not stored keys)
- Least privilege per service; quarterly access reviews
Email & notifications
- AWS SES — transactional email (alerts, scheduled reports, password resets)
- AWS SNS — internal alerting fanout (email / SMS / Slack via webhook)
- WorkOS sends its own auth-related email (verification, MFA codes) via AuthKit
Infrastructure as Code
- Terraform (HCL) — all AWS infrastructure as code
- Modular: one module per logical layer (vpc, ecs-fargate, rds-postgres, elasticache, s3, logging, observability, iam, secrets)
- Per-environment tfvars scale resources without code changes
- Terraform state in shared-services S3 with DynamoDB locking
- GitHub Actions applies via OIDC + cross-account deploy role (no long-lived AWS keys in CI)
Container registry
- Amazon ECR in shared-services account — single registry; every environment pulls the same images
- Image scanning on push (Snyk or ECR-native)
- Lifecycle policy: retain last 50 images, expire older
Disaster recovery
| Failure mode | Target | Mechanism |
|---|---|---|
| Single task / Fargate failure | RTO seconds | ALB removes from rotation; auto-replacement |
| Single AZ failure | RTO < 1 hour | Multi-AZ failover on RDS, ALB, Redis; Fargate cross-AZ spread |
| Data corruption | RPO < 15 min | RDS continuous WAL + 5-min snapshots; PITR |
| Region failure | RTO < 24 hours | Cross-region encrypted backups + Terraform redeploy runbook |
Annual DR drill — tested + documented (SOC 2 mandate). Weekly backup verification — automated restore tests.
Phase 2 — Israel region (AWS Tel Aviv, il-central-1)
Same architecture, deployed as a separate AWS account (set-prod-il) when the first Nimbus-aligned Israeli customer (government / defense / regulated healthcare / banking) signs:
- Same Terraform modules, IL-specific tfvars
- Self-hosted Keycloak instance in IL for Israeli data residency
- Per-tenant routing in NestJS auth middleware (WorkOS for non-Nimbus, Keycloak for Nimbus)
- Logz.io destination TBD at that time (Israel-region option, or self-hosted ELK)
API Strategy (Dual Protocol)
| Layer | Protocol | Why |
|---|---|---|
| Frontend ↔ BFF | GraphQL (@nestjs/graphql) | Dashboard aggregation (1 query per view), cross-framework graph data, end-to-end type safety via codegen |
| Public/Partner API | REST (@nestjs/swagger / OpenAPI) | Broad compatibility, simple rate limiting, clean audit trail per endpoint |
| Both served from same NestJS server | Shared auth + tenant middleware | Single deployment, consistent security layer |
GraphQL (Internal — Frontend Only):
- Powered by Apollo Server via
@nestjs/graphql - Schema-first approach with GraphQL Code Generator → auto-generated TypeScript types + React hooks
- Relay-style cursor pagination (max 100 per page) to prevent DoS
- DataLoaders for batching/deduplication (prevent N+1 queries)
- Query depth/complexity limiting to prevent abuse
- Compliance dashboards, cross-framework mapping, and evidence graphs all benefit from single-query aggregation
REST (External — Public API):
- OpenAPI 3.0 spec with Swagger UI documentation
- API key authentication (scoped per tenant, per-endpoint read/write permissions)
- Rate limiting (per-key, per-IP)
- Versioned endpoints (v1, v2)
- Webhook support for real-time event notifications
- Every mutation logged as an audit event