The 23 disciplines a service must obey to be portable, observable, recoverable, secure, cost-aware, and continuously deployable in a modern cloud-native, AI-aware system.
A standing contract between every service and the platform that hosts it. Each factor is a discipline. A service that obeys these factors is portable, observable, recoverable, secure, cost-aware, and continuously deployable.
A roadmap, a methodology, a feature checklist, or a substitute for product or domain design. It does not prescribe the size of a service, its language, its bounded context, or its team. It describes the properties any service must hold to be production-ready.
These factors descend from the original Twelve-Factor App (Wiggins, 2012) and Kevin Hoffman's Beyond the Twelve-Factor App (2016). They expand that lineage to address the architecture, security, observability, AI-native, and FinOps realities of 2026. Predecessor mappings are in Appendix B.
The principles apply uniformly to a single microservice, a domain service, or a platform-of-platforms. The discipline is universal; the investment is tier-dependent. A small service can declare modest targets (e.g., "99.5% available, RTO 4h") and be fully compliant. A critical platform service might require 99.99% and RTO 5m. Follow each factor's principle; size the implementation to the service's tier. Where the recommendations describe more rigor than a small or experimental service warrants, treat them as the upper-bound reference — adopt the principle, then choose a proportionate implementation.
Each factor follows the same shape: Principle (the rule), In 2026 (current tools and patterns), and an Avoid callout. Cross-references use (see N). Examples of named services are given in four flavours wherever possible: AWS, Azure, GCP, and a self-hosted / cloud-agnostic equivalent. Glossary in Appendix A; heritage in Appendix B.
| # | Factor | One-line rule |
|---|---|---|
| 1 | Polyglot Mono-Repo, Symmetric Services | One repository, many runtimes, one service shape. |
| 2 | Contract-First, Multi-Audience | OpenAPI for services, MCP for agents, AsyncAPI for events — all versioned, all in repo before code. |
| 3 | Versioned, Backwards-Compatible Evolution | Contracts and data evolve; old consumers keep working. |
| 4 | Externalized Configuration, Secrets, Infrastructure, and Policy | Nothing inline; everything declarative and version-controlled. |
| 5 | Provenance-Tracked Dependencies | Lockfiles, SBOMs, signed images, vulnerability and license scans — every byte explainable. |
| 6 | Dev = CI = Prod | Devcontainers and identical backing services across every environment. |
| 7 | Build Once, Sign Once, Deploy Many | One immutable artifact promoted across environments; rollback is a digest swap. |
| 8 | Progressive, Feature-Flagged Delivery | Code reaches production well before users; previews exist for every PR; rollout is independent of deployment. |
| 9 | Stateless, Disposable, Idempotent, Horizontal | Processes hold no state, start and stop fast, are safe to retry, scale by replication. |
| 10 | Self-Bound Ports for Every Audience | Each service binds its own ports for HTTP, MCP, and A2A. |
| 11 | Backing Services as Bound Resources | Every external dependency is configuration-bound and swappable. |
| 12 | Async Messaging, Scheduled Work, and Durable Workflows | Broker by default; streams for replay/fan-out; jobs for cron; durable execution for long flows; signed webhooks for outbound. |
| 13 | Edge, Ingress, Gateway, and CDN Discipline | Every external request enters through a hardened, observable, policy-enforced edge. |
| 14 | Tenancy and Blast-Radius Isolation | Tenant boundaries are explicit at every layer; failures are contained. |
| 15 | Layered Testing, Including Non-Deterministic | Unit, integration, contract, end-to-end, performance, security, evals — each layer has a defined gate. |
| 16 | Observability via OpenTelemetry | One pipeline for logs, metrics, traces, and GenAI signals. |
| 17 | Resilience by Default | Every outbound interaction declares timeout, retry, circuit-breaker, bulkhead, and cost policy. |
| 18 | Disaster Recovery and Business Continuity | RTO and RPO defined per tier, replication explicit per data class, restores rehearsed. |
| 19 | SLOs, Error Budgets, and Runbooks-as-Code | Define what "working" means; measure it; respond to it. |
| 20 | Zero-Trust Identity and Authorization | No trusted network; every request authenticated and authorized at every layer. |
| 21 | Privacy, Data Classification, and Audit | Classify data, minimize collection, bound retention, redact at telemetry, audit immutably. |
| 22 | FinOps as a First-Class Property | Compute, storage, network, and AI costs are attributed per service, per tenant, per request. |
| 23 | Documentation, Decisions, and Machine-Readable Seams | Repository organized for both humans and software agents. |
One repository holds many services across many runtimes; every service follows the same shape regardless of language.
Polyglot mono-repos are the dominant pattern at scale (Nx, Turborepo, Bazel, Pants, or folder conventions). Language is a runtime detail; service shape is a contract — health endpoints, log format, observability instrumentation, container layout, and security middleware are identical across runtimes. Conventional commits and shared linting apply repository-wide. Per-runtime gates (formatters, type checkers, linters) run alongside repository-wide gates (commit-message lint, markdown lint, YAML lint, GitHub Actions / pipeline lint). One task entry-point per operation (bootstrap, up, test, lint) via Taskfile, just, or make so newcomers don't have to learn each runtime's idioms.
Wire-format drift between languages (PascalCase vs. camelCase, ISO-8601 vs. epoch ms — normalize in the contract, not by convention); copy-pasted "service templates" that diverge over time (regenerate, don't fork).
Every service publishes machine-readable contracts for its three audiences — humans, other services, and software agents — before implementation.
Three contract surfaces, all in source control:
Mocks, server stubs, and client SDKs are generated from specs. CI fails if code drifts from spec. A repository-wide aggregator (Backstage, Port, or a static catalog site) publishes the contracts for human and agent discovery.
Auto-generating MCP tools 1:1 from REST endpoints (agents need capabilities, not CRUD verbs); free-text "options" blobs in tool inputs; treating one audience as primary and the others as afterthoughts.
Contracts and data evolve; consumers don't break. Old versions keep working until consumers have demonstrably migrated.
/v1, /v2 URL prefixes; Deprecation and Sunset headers; deprecate before remove.getOrder.v1, getOrder.v2); manifests advertise both during deprecation.schemaVersion; new versions emit only after consumers can read them.BACKWARD default, FORWARD, or FULL); subject naming strategy committed; partition key is a contract (changing it = new topic); retention and compaction policy are contract terms; tombstones (null value on compacted topics = delete-by-key) are documented behaviour.node-pg-migrate, sqlx — idempotent, reversible, run in the deploy pipeline (see 7).A nightly compatibility-check job replays sampled production traffic against the candidate build to catch regressions before they reach customers.
"We'll bump everyone at once" (collapses when external agents or third-party consumers are in the mix); mixing schema and behaviour changes in one migration (separate them so each can be rolled back).
Everything that varies between environments — configuration, secrets, infrastructure topology, policy rules — lives outside the image, is declarative, and is version-controlled.
infra/, deployed via the CI/CD pipeline. One module per shared resource and one per service.policy/, enforced advisory in CI and blocking at runtime by a policy engine. Equivalents per cloud: AWS Service Control Policies / Config Rules, Azure Policy, GCP Organization Policy / Policy Controller; OPA Gatekeeper or Kyverno for Kubernetes-native enforcement.Litmus test: if this entire repository were pushed to a public mirror tomorrow, what would leak? Anything beyond "nothing" indicates incomplete externalization. Drift detection (terraform plan, bicep what-if, pulumi preview) runs in CI on every infra change; out-of-band manual changes in production trigger an alert.
Secrets stored in the config service (configuration is for non-sensitive values; only the vault holds credentials); "just this one config file in the image" — a single exception destroys the immutability story (see 7).
Every byte that ships to production is explainable, scanned, signed, and locked.
packages.lock.json, uv.lock / poetry.lock, pnpm-lock.yaml / package-lock.json, Cargo.lock, go.sum, etc.The artifact promoted from staging to production is the same digest — no rebuilds across environments. Production admission policy refuses unsigned images.
Transitive dependencies that bypass scans (private packages, unpinned base layers, install-time downloads); excepting "internal-only" dependencies from signing — internal is where the next supply-chain attack will originate.
The dev loop on a laptop, the build in CI, and the runtime in production share dependencies, tooling, and behaviour.
.devcontainer/devcontainer.json per workspace; one click in the IDE, identical environment on Windows + macOS + Linux.mise, asdf, nvm + pyenv, rtx) pins runtime versions; CI fails if a developer's local toolchain is below the floor.pre-commit, lefthook, husky) run formatters and quick lints; CI re-runs them as gates.In-memory test doubles for backing services (H2 instead of Postgres, sqlite instead of MySQL — they lie about behaviour at exactly the wrong moments); environmental drift hidden in locale, timezone, or case-sensitivity.
A single immutable artifact (image digest) is promoted across every environment. Configuration — not code — differs.
<service>:<git-sha> plus <service>:<semver> for releases. No :latest anywhere.release.json accompanies each build with image digest, SBOM hash, source commit, and matching infrastructure module versions.Mutable tags (:latest, :main) — they turn rollbacks into archaeology and admission policies into theatre; "for development only" config baked into images — there is no "for development only."
Releases and rollouts are independent. Code reaches production well before users; every PR is reviewable on a real environment.
feature-flags/.Flag debt — alert at 30/60/90 days; rolling out a flag and a code change in the same deploy (the whole point of flags is to decouple them); long-lived preview environments — they drift from main and become their own incident surface.
Processes hold no long-lived state, start and stop quickly, are safe to retry, and scale by replication.
Four facets of one architectural commitment — a service is a fungible replica.
Idempotency-Key; every message handler dedupes by message ID. "At least once" delivery is assumed.Real-time exception. A WebSocket / SignalR / SSE service holds connection state by definition. The discipline still applies — connection state is held in a backing service (see 11: Real-time hub) and the service process itself remains fungible. Any one replica can serve any one connection because the hub manages routing.
Shared idempotency middleware stores (idempotency-key, response-hash, expires-at) in a fast key-value store (e.g., Redis). Message handlers persist (message_id, processed_at, result_hash) before committing side effects (outbox pattern, see 12). Readiness flips to "not ready" before SIGTERM completes so the platform drains traffic.
"Just one tiny in-memory counter" (where horizontal scaling dies); in-process caches that take minutes to warm; distributed locks as a casual coordination primitive; rolling your own WebSocket fan-out instead of using a managed real-time hub.
Each service binds its own listening ports for every audience it serves. The platform routes; the service serves.
One service, multiple listeners:
The platform handles ingress, mTLS, and routing. The service is self-contained — no IIS, Tomcat, or external app server. Each runtime's entry point binds three configurable ports — PORT_HTTP, PORT_MCP, PORT_A2A — with stable defaults across local and cloud. A2A endpoints sit behind the same authentication layer (see 20) as MCP and HTTP.
The MCP surface drifting from the HTTP surface (same business capability, different framing — not different capabilities); hosting multiple services in a single container.
Every external dependency is attached at runtime via configuration and is swappable without redeploy.
Backing services span a far wider class than the original 12-factor "DB + cache + broker." A service that omits any relevant class below is implicitly putting that responsibility inside the application — almost always the wrong choice. Each row gives an AWS / Azure / GCP / self-hosted option.
| Class | Purpose | AWS | Azure | GCP | Self-hosted / cloud-agnostic |
|---|---|---|---|---|---|
| Relational DB | Transactional records | RDS, Aurora | Azure Database for PostgreSQL/MySQL/SQL | Cloud SQL, AlloyDB | PostgreSQL, MySQL, MariaDB, CockroachDB |
| Document / KV | Schema-flexible records | DynamoDB, DocumentDB | Cosmos DB | Firestore, Bigtable | MongoDB, Cassandra, ScyllaDB, Couchbase |
| Cache | Hot-path, session, rate-limit state | ElastiCache, MemoryDB | Azure Cache for Redis | Memorystore | Redis, KeyDB, Dragonfly, Memcached, Hazelcast |
| Search index | Full-text, faceted, hybrid | OpenSearch Service, Kendra | Azure AI Search | Vertex AI Search | Elasticsearch, OpenSearch, Meilisearch, Typesense, Algolia |
| Vector store | Embeddings, semantic retrieval | OpenSearch k-NN, Bedrock KB, Aurora pgvector | AI Search vectors, Cosmos DB vector | Vertex AI Vector Search, AlloyDB pgvector | pgvector, Qdrant, Weaviate, Milvus, Pinecone, Chroma |
| Object storage | Blobs, files, media, archive | S3 (Standard / IA / Glacier) | Blob Storage (Hot / Cool / Archive) | Cloud Storage (Standard / Nearline / Coldline / Archive) | MinIO, Ceph, SeaweedFS, Garage |
| Message broker (see 12) | Commands, work queues | SQS, Amazon MQ | Service Bus | Pub/Sub (with ordering keys) | RabbitMQ, NATS JetStream, ActiveMQ Artemis |
| Event stream (see 12) | Replay, CDC, fan-out, audit | Kinesis Data Streams, MSK | Event Hubs (Kafka-compatible) | Pub/Sub Lite, Dataflow | Apache Kafka, Confluent, Redpanda, Apache Pulsar |
| Workflow engine | Long-running orchestration | Step Functions, SWF | Durable Functions, Logic Apps | Cloud Workflows, Cloud Composer | Temporal, Cadence, Dapr Workflow, Argo Workflows, Conductor |
| Schema registry | Runtime contract enforcement | Glue Schema Registry | Azure Schema Registry | (application-layer; Confluent on GCP) | Confluent Schema Registry, Apicurio, Karapace |
| Real-time hub | Persistent connections, presence, fan-out | AppSync subscriptions, IoT Core, API Gateway WS | Azure SignalR Service, Web PubSub | Firebase Realtime Database, Firestore listeners | Centrifugo, Soketi, Pusher, Ably |
| Push notifications | Device delivery when app is closed | SNS Mobile Push, Pinpoint | Notification Hubs (APNS/FCM/WNS) | Firebase Cloud Messaging | OneSignal, Gotify, ntfy |
| Outbound communication | Email, SMS, voice, WhatsApp | SES, SNS, Pinpoint, Connect | Communication Services | (third-party; or partner add-ons) | SendGrid, Mailgun, Postmark, Twilio, MessageBird, Plivo |
| Inbound communication | Email-to-event, SMS receive, IVR | SES Receiving, Pinpoint two-way | Communication Services Email Receiving | (via partner) | SendGrid Inbound Parse, Twilio inbound, Mailgun Routes, signed-webhook receivers |
| CDN / edge cache (see 13) | Static assets, cacheable GETs | CloudFront | Front Door CDN, Azure CDN | Cloud CDN, Media CDN | Cloudflare, Fastly, Akamai, BunnyCDN, KeyCDN |
| Identity provider | Authentication, federation, SSO, B2C | Cognito, IAM Identity Center | Microsoft Entra ID, Entra External ID | Cloud Identity, Identity Platform, Firebase Auth | Keycloak, Authentik, Zitadel, Auth0, Okta, FusionAuth |
| Secrets manager | Credentials, keys, certificates | Secrets Manager, Parameter Store | Key Vault | Secret Manager | HashiCorp Vault, Bitwarden Secrets, Infisical, OpenBao |
| Configuration service | Non-secret values, feature flags | AppConfig | App Configuration | Runtime Config, Firebase Remote Config | Consul, etcd, Spring Cloud Config, Unleash, Flagsmith |
| Observability backend (see 16) | Trace, metric, log destination | CloudWatch, X-Ray, Managed Prometheus / Grafana | Application Insights, Azure Monitor | Cloud Operations (Logging / Monitoring / Trace) | Grafana stack (Loki/Mimir/Tempo), Prometheus, Datadog, New Relic, Honeycomb, Elastic, SigNoz |
| LLM provider | Generation, embeddings, evaluation | Bedrock, SageMaker | Azure OpenAI, AI Foundry | Vertex AI | Anthropic API, OpenAI API, Mistral API, Cohere; self-hosted Ollama, vLLM, TGI, llama.cpp |
| External SaaS / API | Domain-specific third-party | Payment, geocoding, OCR, KYC, mapping, B2B/EDI gateways, etc. — anything not in the rows above. | |||
The LLM row is included deliberately. An LLM is a backing service that demands additional rigor, not a separate architectural concept.
Each binding obeys these rules. The application declares its need; the platform provides the binding. Code receives an interface, not a connection string. Provider-specific SDKs do not leak. Every binding is wrapped by a circuit breaker (see 17), tagged for cost attribution (see 22), and has a documented degradation path when the dependency is unavailable.
gpt-5-2025-09-01, claude-opus-4-7-20260101) — no silent provider-side upgrades.prompts/<task>/v<n>.md, templated, versioned, eval-tested (see 15).Tight coupling to provider-specific SDKs; treating an LLM as "just an HTTP call"; rolling your own real-time fan-out, deliverability tracking, or push-notification routing when a managed service exists; using one backing service to fake another (a stream as a work queue, a database as a cache, a cache as durable storage).
Cross-service communication prefers events on a bus or stream over synchronous RPC. Scheduled work runs as cron-style jobs. Long-running flows are held by durable execution. Outbound integrations go through signed, retried, observable webhooks. Each substrate is picked for its specific shape — they are not interchangeable.
1. Message broker — the default for cross-service messaging. AWS SQS / Amazon MQ; Azure Service Bus; GCP Pub/Sub (with ordering keys); RabbitMQ, NATS JetStream, ActiveMQ Artemis self-hosted — for command-style messages and work queues. Rich filtering, dead-letter queues, transactional handoff. Lower throughput than streams, ack-on-consume, server-managed offsets.
Choose a broker when: a single consumer should act on each message; the message represents a command or work unit, not a historical fact; routing, filtering, or dead-letter handling matters; transactional handoff with the producing database matters; throughput is modest (thousands per second per topic, not hundreds of thousands). This covers the majority of microservice traffic. In doubt, start with the broker.
2. Event stream — for replay, fan-out, and audit-by-default. AWS Kinesis / MSK; Azure Event Hubs (Kafka-compatible); GCP Pub/Sub Lite or Dataflow; Apache Kafka / Confluent / Redpanda / Pulsar self-hosted — for high-throughput append-only logs. Replayable, multiple consumer groups, partition-ordered.
Choose a stream when: many independent consumers need the same events at their own pace; replay from a past offset is a real requirement (event sourcing, late-joining consumers, reprocessing after a bug fix, regulatory replay); throughput exceeds what brokers handle comfortably; CDC, analytics fan-out, ML pipelines, or audit streams are the use case; partition-based ordering of related events matters.
A single business event may legitimately flow through both — e.g., to a broker for immediate consumers and to a stream for analytics and replay.
Event-sourcing-as-source-of-truth (publish to stream before writing to DB; rebuild state from the stream) is a legitimate but costly pattern. It buys auditability, time-travel, and CQRS read-model freedom; it costs operational complexity, eventual-consistency reasoning at every read, and a hard dependency on stream availability for writes. Adopt it deliberately, per bounded context, with an ADR (see 23).
3. Scheduled work — cron jobs as a first-class workload. Periodic, time-triggered work runs as scheduled jobs on the platform's job runner: AWS EventBridge Scheduler, Lambda Scheduled Events, ECS Scheduled Tasks; Azure Container Apps Jobs, Logic Apps Recurrence, Functions Timer; GCP Cloud Scheduler + Cloud Run Jobs, Workflows; Kubernetes CronJobs, Argo Workflows CronWorkflow self-hosted. Discipline:
4. Durable workflows — for long-running orchestration. AWS Step Functions; Azure Durable Functions; GCP Cloud Workflows; Temporal, Cadence, Dapr Workflow, Argo Workflows self-hosted — for flows spanning seconds-to-days. State lives in the engine, not in process memory. Sagas with compensating actions handle distributed transactions.
The outbox pattern (write to DB and an outbox table in the same transaction; a separate process publishes from the outbox) is the reliable way to publish events from a transactional database — and works equally for broker, stream, and outbound webhook delivery.
Long-running agent runs are a special case of durable workflows. Loop state (steps, tool calls, decisions) is checkpointed; every run carries hard limits (max steps, max wall-clock, max tool calls, max tokens, max cost). Replay from the run record enables debugging and post-mortem.
5. Outbound webhooks — for delivering events to third-party consumers. When the consumer is outside the platform — a customer's URL, a partner system — the delivery channel is a webhook, not a broker subscription. Discipline:
Envelopes carry trace context (see 16), tenant ID (see 14), and schemaVersion (see 3) — identical envelope shape regardless of substrate. Synchronous RPC is reserved for low-latency, end-user-facing reads.
Using streams as a generic message bus when no consumer needs replay (you pay the operational cost of offsets and consumer groups for nothing); hidden synchronous chains masquerading as async (await is not an event); cron expressions buried inside application code rather than in IaC; agent loops without hard limits — a denial-of-wallet attack waiting to happen; adopting event-sourcing-as-source-of-truth platform-wide because it sounded good in a talk; firing-and-forgetting outbound webhooks inline from a request handler.
Every external request enters through a hardened, observable, policy-enforced edge. Static and cacheable content is served from a CDN at the same edge tier. Internal services never face the public internet directly.
/v1/* and /v2/* route to appropriate backend revisions.Per-route policies (rate limits, JWT validation, request size caps, WAF mode, cache TTL) live as YAML alongside the service's OpenAPI and apply to the gateway on PR merge. Cacheable responses carry explicit Cache-Control and Vary headers; non-cacheable responses say so explicitly. Errors follow RFC 9457 problem-details with a traceId for correlation.
A single bypass of the edge "for one internal use case" — that bypass becomes the next incident's root cause; rate limiting reimplemented in each service (push it to the edge; services enforce only business-logic concurrency); accidental caching of authenticated responses (always vary on the auth header or explicitly mark private).
Tenant boundaries are explicit at every layer. A failure or breach in one tenant does not compromise another.
Even single-tenant systems adopt explicit tenancy from day one — retrofitting it is invasive.
X-Tenant-Id), JWT claim, message envelope field, span attribute, log attribute, query predicate, row filter.Database schemas include a tenant_id column with row-security enforced at the database level — application code cannot disable it. An onboarding pipeline provisions per-tenant infrastructure (IaC parameters) and populates baseline configuration.
Tenant isolation as a code-level convention without database enforcement (one missed WHERE tenant_id = ? becomes the next data-leak incident); "we'll add tenancy later" — costs grow exponentially with existing data and code volume.
Every change passes through a defined testing pyramid before reaching production. Each layer has a clear gate; non-deterministic systems are tested through evals as a first-class layer.
promptfoo, DeepEval, Inspect, ragas, Azure AI Foundry evaluations, AWS Bedrock evaluations, Vertex AI evaluations.Eval datasets are versioned with semver in tests/evals/datasets/; changes go through PR review. A CI eval job posts a comment with score deltas and links. Shadow traffic — candidate prompt or model enabled for opt-in requests with outputs compared offline before promotion. Adversarial / red-team eval sets sit alongside happy-path sets and run on every release candidate.
Testing only the happy path (adversarial tests aren't optional); eval-as-vibes — a human eyeballing a few outputs. Codify the rubric or it doesn't exist.
Logs, metrics, and traces flow through one OpenTelemetry pipeline, including formal semantic conventions for AI workloads.
gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.response.id, plus prompt and completion when not redacted.mcp.tool.name, mcp.tool.version, arguments hash, decision outcome.trace_id, service, tenant_id, env.Each runtime auto-instruments HTTP, DB, broker, stream, MCP server, LLM client, real-time hub, communication channels, webhook dispatcher, and agent loop. Service code emits domain events; plumbing is automatic. A repository-wide attribute schema ensures dashboards and alerts work uniformly across services. PII redaction (see 21) runs in the SDK before export. LLM-specific UX layers — Langfuse, Phoenix, Helicone — are optional supplements, not replacements.
Logging-as-debugging — if you can't answer the question from existing traces, fix the instrumentation, don't add a log line; per-service custom attribute names — they make cross-service dashboards impossible.
Every external interaction declares timeout, retry, circuit-breaker, bulkhead, and cost policy.
Library choices follow language: Polly (.NET), tenacity + httpx-with-Hyx (Python), cockatiel or NestJS interceptors (Node), Resilience4j (JVM), failsafe-go / retry-go (Go), tower (Rust). All configured from a shared resilience policy schema. A resilience.yaml per service declares per-dependency policies; middleware loads them at startup. Resilience policies for backing services in infra/ (broker retry settings, real-time hub connection backoff, gateway retry budgets) are declared alongside the resource definition.
Default infinite retries (they amplify outages into incidents); the same retry policy for all callers (a user request and a background job have different patience budgets); retrying on a destructive non-idempotent endpoint.
Recovery time and recovery point objectives are defined per service tier, replication strategy is explicit per data class, and restores are rehearsed.
DR is a separate discipline from per-request resilience (see 17). Resilience handles a flapping dependency; DR handles a region going dark, an accidental delete-all, or ransomware.
Data replication strategy explicit per data class:
| Data class | Typical replication mode | Notes |
|---|---|---|
| Relational DB | Async geo-replica or active-passive geo-redundant | Promotion runbook required; lag is the RPO floor. AWS RDS / Aurora Global Database; Azure geo-replication / failover groups; GCP Cloud SQL cross-region replicas / AlloyDB |
| Document / NoSQL | Multi-region writes (active-active) where the data model allows | Conflict resolution policy explicit. DynamoDB Global Tables, Cosmos DB multi-region writes, Firestore multi-region, Cassandra multi-DC |
| Cache | Active-active geo-replication for low-latency reads, or rebuild on failover | Treat cache contents as recomputable; not the source of truth |
| Object storage | Cross-region replication for read access during regional outage | S3 CRR; Azure GRS / RA-GRS; GCS dual-region / multi-region; lifecycle policies replicate across regions |
| Message broker | Geo-DR pairing — primary-secondary alias, manual or scripted failover | In-flight messages are NOT replicated; consumers must be idempotent. SQS cross-region forwarding patterns; Service Bus geo-DR; Pub/Sub multi-region by default |
| Event stream | Sync geo-DR or mirror-maker — zero or near-zero data loss | Stream is the audit log; replication is non-negotiable. MSK Replicator; Event Hubs geo-DR; Pub/Sub Lite cross-region; Kafka MirrorMaker 2 / Confluent Cluster Linking |
| Real-time hub | Regional with client reconnection on failover | Connection state is ephemeral by design |
| Vector store / Search index | Rebuild from source-of-truth data store | Index in-region; reindex on regional recovery |
Each service declares its tier in slos/<service>.yaml with enforceable RTO/RPO targets. Critical-tier services run in two regions with active/passive failover. Promotion runbook in runbooks/dr/<service>.md, exercised quarterly. Backup restore drills run automatically against a non-prod environment and produce a pass/fail signal for the SLO dashboard.
Treating "the cloud" as inherently durable; untested backups — until a restore has succeeded, the backup is a hypothesis; replicating cache contents instead of regenerating them; assuming broker geo-DR replicates in-flight messages (it doesn't).
What "working" means is defined in code, measured continuously, and connected to operational decisions.
slos/<service>.yaml.runbooks/<symptom>.md, linked from every alert. Runbooks are code: linted, reviewed, exercised in chaos days (see 17) and DR drills (see 18).Repository has top-level slos/, runbooks/, and postmortems/. CI lints alerts to ensure each has a linked runbook. SLO definitions apply to the observability backend as code (Sloth, OpenSLO, Datadog SLO IaC, Azure Monitor SLO IaC, GCP Service Monitoring SLO IaC). The on-call schedule itself is defined as code (PagerDuty / Opsgenie / Grafana OnCall configuration in oncall.yaml).
SLOs nobody reads — if the dashboard isn't part of the regular operating cadence, the SLO is aspirational; heroic recovery without postmortems — the next incident has the same root cause and a different victim.
No network is trusted by default. Every request, from any source, carries an identity, and authorization is enforced at every layer.
Every HTTP, MCP, and A2A endpoint requires authentication — no anonymous routes outside /health/*. Shared auth/ middleware parses tokens and exposes a principal (user / service / agent + roles + tenant + clearance). Tool-level RBAC is declared in mcp.yaml and enforced before invocation. Destructive tools (any side effect that cannot be reversed by a subsequent call) require either elevated scope or a human-in-loop checkpoint. CSPM rules (AWS Config, Azure Defender for Cloud, GCP Security Command Center, or self-hosted Cloud Custodian / Steampipe) block deployment of any resource without a managed identity.
"Internal" endpoints that skip auth because "they're behind the firewall" — zero-trust means no inside; long-lived API keys — they become the next credential leak.
Every data field has a classification. Collection is minimized, location is known, retention is bounded, sensitive data is redacted before crossing into prompts and observability, and every state-changing action is auditable.
A repository-wide data-classification.yaml lists every field-name pattern and its classification. CI fails if a new field lacks one. A shared privacy/ middleware redacts classified fields from logs and traces. Vector stores tag every embedding with source classification; retrieval filters honor the caller's clearance. The audit log is a separate event stream piped to immutable object storage; no service writes to audit storage directly.
Free-text fields that quietly become PII landfills (classify by the highest-classification content they might hold); audit logs in the same store as application data — co-located audit logs are tampered audit logs in the wrong incident.
Compute, storage, network, and AI token costs are attributable per service, per tenant, per request — and visible to engineers in their normal workflow.
service, env, team, cost-center, tier). Enforced via AWS Tag Policies, Azure Policy, GCP Organization Policy, or Crossplane policy.cost.tokens.prompt, cost.tokens.completion, cost.compute.ms, cost.estimated_usd).A nightly job rolls cost up by service, endpoint, and tenant; results appear in a service catalog tab and observability workbooks. The shared LLM client blocks calls that would exceed the per-request budget unless explicitly elevated. Quarterly cost reviews are part of the engineering cadence — owned by service teams, not finance.
Aggregate-only dashboards (the unit-economics question requires per-request granularity); treating cost as somebody else's problem (the team that ships the code owns its operating cost).
The repository is structured for two readers — humans and software agents — and they're now the same audience. Decisions are captured where future readers will look.
AI agents (Claude Code, Copilot, Cursor, Aider, Cline) are permanent collaborators. Repository structure, naming, and documentation are architectural choices that determine how effectively those collaborators — and humans — can work.
CLAUDE.md, .cursorrules, .github/copilot-instructions.md, AGENTS.md, .aider.conf.yml — generated from a single source explaining architecture, conventions, gotchas, and how to run things. Per-service equivalents cover service-specific concerns.docs/adr/NNNN-<slug>.md (MADR format) capture every architectural decision with date, status, context, options considered, decision, and consequences. Reversing an ADR requires a new ADR that supersedes it.OrderRepository over OrderHelper; cancelOrder() over processOrder().docs/onboarding.md — that an external collaborator (human or agent) can follow to a working dev loop without external help.A weekly job summarizes recent ADRs and posts to the team channel — decisions don't get lost in a folder.
Documentation that becomes a parallel universe — beautiful, ignored, wrong (tie docs to executable artifacts so drift is detectable in CI); ADRs as compliance theater — written after the fact to justify a decision already made.
| A2A | Agent-to-Agent — protocol for inter-agent communication including capability discovery and trust handshake. |
| ADR | Architecture Decision Record (MADR format). |
| APM | Application Performance Monitoring. |
| AsyncAPI | OpenAPI-equivalent specification for event-driven and message-based APIs. |
| CDC | Change Data Capture — streaming database changes as events. |
| CDN | Content Delivery Network. |
| CSPM | Cloud Security Posture Management. |
| Dapr | Distributed Application Runtime — sidecar primitives for service mesh, state, pubsub, secrets. |
| DPDP | India's Digital Personal Data Protection Act, 2023. |
| DR | Disaster Recovery. |
| FinOps | Discipline of managing variable cloud and AI spend as an engineering concern. |
| GenAI Conventions | OpenTelemetry semantic conventions for generative-AI telemetry. |
| IaC | Infrastructure as Code. |
| KEDA | Kubernetes Event-Driven Autoscaling. |
| LLM-as-judge | Pattern where an LLM scores another LLM's output against a rubric. |
| MADR | Markdown Any Decision Records — common ADR format. |
| MCP | Model Context Protocol — standard for exposing tools, prompts, and resources to LLM agents. |
| mTLS | Mutual TLS — both sides authenticate by certificate. |
| OIDC | OpenID Connect. |
| OPA | Open Policy Agent (with the Rego policy language). |
| OTel / OTLP | OpenTelemetry / OpenTelemetry Protocol. |
| PII | Personally Identifiable Information. |
| RAG | Retrieval-Augmented Generation. |
| RBAC | Role-Based Access Control. |
| RTO / RPO | Recovery Time Objective / Recovery Point Objective. |
| SBOM | Software Bill of Materials. |
| Sigstore / Notation | Container image and artifact signing systems. |
| SLI / SLO | Service Level Indicator / Service Level Objective. |
| WAF | Web Application Firewall. |
This document derives from two predecessors.
| Original factor | 23 Factors |
|---|---|
| I. Codebase | 1 |
| II. Dependencies | 5 |
| III. Config | 4 |
| IV. Backing services | 11 |
| V. Build, release, run | 7 |
| VI. Processes | 9 |
| VII. Port binding | 10 |
| VIII. Concurrency | 9 |
| IX. Disposability | 9 |
| X. Dev/prod parity | 6 |
| XI. Logs | 16 |
| XII. Admin processes | 7, 12, 19 (distributed) |
| Hoffman factor | 23 Factors |
|---|---|
| 1. One Codebase | 1 |
| 2. API First | 2 |
| 3. Dependency Management | 5 |
| 4. Design, Build, Release, Run | 7 (with design distributed across 2, 5, 23) |
| 5. Configuration, Credentials, Code | 4 |
| 6. Logs | 16 |
| 7. Disposability | 9 |
| 8. Backing Services | 11 |
| 9. Environment Parity | 6 |
| 10. Administrative Processes | 7, 12, 19 |
| 11. Port Binding | 10 |
| 12. Stateless Processes | 9 |
| 13. Concurrency | 9 |
| 14. Telemetry | 16 |
| 15. Authentication and Authorization | 20 |
Beyond the inherited factors, this manifesto introduces explicit disciplines for: contract-first multi-audience design including agents (2), versioned backwards-compatible evolution across HTTP, MCP, events, streams, and database (3), provenance and supply-chain integrity (5), progressive feature-flagged delivery with preview environments (8), self-bound ports for HTTP / MCP / A2A audiences (10), expanded backing-services taxonomy covering real-time hubs, push, communications, CDN, LLM, and external SaaS (11), broker / stream / scheduled-job / durable-workflow / outbound-webhook discipline (12), edge / ingress / gateway / CDN as one tier (13), tenancy and blast-radius isolation (14), layered testing with evals as a first-class layer (15), OpenTelemetry for AI workloads (16), resilience by default with cost circuit breakers (17), DR with explicit replication strategy per data class (18), SLOs and runbooks-as-code (19), zero-trust identity for humans, services, and agents (20), privacy / classification / audit (21), FinOps as engineering discipline (22), and machine-readable seams for human and agent collaborators (23).