23 Factors — Cloud-Native Engineering Factors

About this manifesto

What this is

A standing contract between every service and the platform that hosts it. Each factor is a discipline. A service that obeys these factors is portable, observable, recoverable, secure, cost-aware, and continuously deployable.

What this is not

A roadmap, a methodology, a feature checklist, or a substitute for product or domain design. It does not prescribe the size of a service, its language, its bounded context, or its team. It describes the properties any service must hold to be production-ready.

Heritage

These factors descend from the original Twelve-Factor App (Wiggins, 2012) and Kevin Hoffman's Beyond the Twelve-Factor App (2016). They expand that lineage to address the architecture, security, observability, AI-native, and FinOps realities of 2026. Predecessor mappings are in Appendix B.

Applicability

The principles apply uniformly to a single microservice, a domain service, or a platform-of-platforms. The discipline is universal; the investment is tier-dependent. A small service can declare modest targets (e.g., "99.5% available, RTO 4h") and be fully compliant. A critical platform service might require 99.99% and RTO 5m. Follow each factor's principle; size the implementation to the service's tier. Where the recommendations describe more rigor than a small or experimental service warrants, treat them as the upper-bound reference — adopt the principle, then choose a proportionate implementation.

The 2026 landscape this answers to

Containers, managed orchestration, and GitOps are baseline.
OpenTelemetry has settled the observability standards war, including for AI workloads.
Supply-chain attacks made provenance, SBOMs, and image signing non-negotiable.
Infrastructure, configuration, and policy are declarative and version-controlled.
Services have three audiences: humans, other services, and software agents.
Reliability is engineered — SLOs, error budgets, runbooks, chaos drills, and DR rehearsals are part of the build.
Cloud and AI costs are an engineering concern, not a finance memo.

How to read this

Each factor follows the same shape: Principle (the rule), In 2026 (current tools and patterns), and an Avoid callout. Cross-references use (see N). Examples of named services are given in four flavours wherever possible: AWS, Azure, GCP, and a self-hosted / cloud-agnostic equivalent. Glossary in Appendix A; heritage in Appendix B.

The 23 Factors at a Glance

#	Factor	One-line rule
1	Polyglot Mono-Repo, Symmetric Services	One repository, many runtimes, one service shape.
2	Contract-First, Multi-Audience	OpenAPI for services, MCP for agents, AsyncAPI for events — all versioned, all in repo before code.
3	Versioned, Backwards-Compatible Evolution	Contracts and data evolve; old consumers keep working.
4	Externalized Configuration, Secrets, Infrastructure, and Policy	Nothing inline; everything declarative and version-controlled.
5	Provenance-Tracked Dependencies	Lockfiles, SBOMs, signed images, vulnerability and license scans — every byte explainable.
6	Dev = CI = Prod	Devcontainers and identical backing services across every environment.
7	Build Once, Sign Once, Deploy Many	One immutable artifact promoted across environments; rollback is a digest swap.
8	Progressive, Feature-Flagged Delivery	Code reaches production well before users; previews exist for every PR; rollout is independent of deployment.
9	Stateless, Disposable, Idempotent, Horizontal	Processes hold no state, start and stop fast, are safe to retry, scale by replication.
10	Self-Bound Ports for Every Audience	Each service binds its own ports for HTTP, MCP, and A2A.
11	Backing Services as Bound Resources	Every external dependency is configuration-bound and swappable.
12	Async Messaging, Scheduled Work, and Durable Workflows	Broker by default; streams for replay/fan-out; jobs for cron; durable execution for long flows; signed webhooks for outbound.
13	Edge, Ingress, Gateway, and CDN Discipline	Every external request enters through a hardened, observable, policy-enforced edge.
14	Tenancy and Blast-Radius Isolation	Tenant boundaries are explicit at every layer; failures are contained.
15	Layered Testing, Including Non-Deterministic	Unit, integration, contract, end-to-end, performance, security, evals — each layer has a defined gate.
16	Observability via OpenTelemetry	One pipeline for logs, metrics, traces, and GenAI signals.
17	Resilience by Default	Every outbound interaction declares timeout, retry, circuit-breaker, bulkhead, and cost policy.
18	Disaster Recovery and Business Continuity	RTO and RPO defined per tier, replication explicit per data class, restores rehearsed.
19	SLOs, Error Budgets, and Runbooks-as-Code	Define what "working" means; measure it; respond to it.
20	Zero-Trust Identity and Authorization	No trusted network; every request authenticated and authorized at every layer.
21	Privacy, Data Classification, and Audit	Classify data, minimize collection, bound retention, redact at telemetry, audit immutably.
22	FinOps as a First-Class Property	Compute, storage, network, and AI costs are attributed per service, per tenant, per request.
23	Documentation, Decisions, and Machine-Readable Seams	Repository organized for both humans and software agents.

Polyglot Mono-Repo, Symmetric Services

One repository holds many services across many runtimes; every service follows the same shape regardless of language.

In 2026

Polyglot mono-repos are the dominant pattern at scale (Nx, Turborepo, Bazel, Pants, or folder conventions). Language is a runtime detail; service shape is a contract — health endpoints, log format, observability instrumentation, container layout, and security middleware are identical across runtimes. Conventional commits and shared linting apply repository-wide. Per-runtime gates (formatters, type checkers, linters) run alongside repository-wide gates (commit-message lint, markdown lint, YAML lint, GitHub Actions / pipeline lint). One task entry-point per operation (bootstrap, up, test, lint) via Taskfile, just, or make so newcomers don't have to learn each runtime's idioms.

Avoid

Wire-format drift between languages (PascalCase vs. camelCase, ISO-8601 vs. epoch ms — normalize in the contract, not by convention); copy-pasted "service templates" that diverge over time (regenerate, don't fork).

Contract-First, Multi-Audience

Every service publishes machine-readable contracts for its three audiences — humans, other services, and software agents — before implementation.

In 2026

Three contract surfaces, all in source control:

OpenAPI 3.1+ for HTTP — services and UIs.
MCP tool manifests for agents — JSON Schema inputs/outputs, side-effect annotations (read-only, idempotent-write, destructive), authorization scopes per tool. Tool descriptions are written for an LLM reader.
AsyncAPI 3+ for events — channels, envelopes, headers, protocol bindings. Payload schema format matches the substrate: JSON Schema embedded for brokers and outbound webhooks; Avro / Protobuf / JSON Schema enforced at runtime by a Schema Registry for event streams (AWS Glue Schema Registry, Azure Schema Registry, Confluent / Apicurio for self-hosted; GCP relies on application-layer validation atop Pub/Sub today). The spec is the document; the registry is the gate.
CloudEvents envelope as the universal cross-system event interop standard.

Mocks, server stubs, and client SDKs are generated from specs. CI fails if code drifts from spec. A repository-wide aggregator (Backstage, Port, or a static catalog site) publishes the contracts for human and agent discovery.

Avoid

Auto-generating MCP tools 1:1 from REST endpoints (agents need capabilities, not CRUD verbs); free-text "options" blobs in tool inputs; treating one audience as primary and the others as afterthoughts.

Versioned, Backwards-Compatible Evolution

Contracts and data evolve; consumers don't break. Old versions keep working until consumers have demonstrably migrated.

In 2026

HTTP: /v1, /v2 URL prefixes; Deprecation and Sunset headers; deprecate before remove.
MCP: tool names embed major version (getOrder.v1, getOrder.v2); manifests advertise both during deprecation.
Broker events and outbound webhooks: versioned envelopes with schemaVersion; new versions emit only after consumers can read them.
Event streams: registry-enforced compatibility on top of envelope versioning. Compatibility mode per subject (BACKWARD default, FORWARD, or FULL); subject naming strategy committed; partition key is a contract (changing it = new topic); retention and compaction policy are contract terms; tombstones (null value on compacted topics = delete-by-key) are documented behaviour.
Database schemas: forward-compatible migrations using expand-then-contract — add and dual-write before reads switch over; remove old columns only after every reader has migrated. Tooling: EF Core migrations, Alembic, Flyway, Liquibase, node-pg-migrate, sqlx — idempotent, reversible, run in the deploy pipeline (see 7).

A nightly compatibility-check job replays sampled production traffic against the candidate build to catch regressions before they reach customers.

Avoid

"We'll bump everyone at once" (collapses when external agents or third-party consumers are in the mix); mixing schema and behaviour changes in one migration (separate them so each can be rolled back).

Externalized Configuration, Secrets, Infrastructure, and Policy

Everything that varies between environments — configuration, secrets, infrastructure topology, policy rules — lives outside the image, is declarative, and is version-controlled.

In 2026

Configuration in a managed config service with hot-reload subscriptions for non-secret values.
AWS AppConfig · Azure App Configuration · GCP Runtime Config / Firebase Remote Config · Self-hosted HashiCorp Consul, etcd, Spring Cloud Config.
Secrets in a managed vault accessed via workload identity — never in env vars, never in repo, never in image layers.
AWS Secrets Manager / Parameter Store · Azure Key Vault · GCP Secret Manager · Self-hosted HashiCorp Vault, Bitwarden Secrets Manager, Infisical.
Infrastructure as code in infra/, deployed via the CI/CD pipeline. One module per shared resource and one per service.
AWS CloudFormation, CDK · Azure Bicep, ARM · GCP Deployment Manager, Config Connector · Cloud-agnostic Terraform, Pulumi, Crossplane.
Policy as code: cloud-native policy assignments and OPA/Rego or Conftest rules in policy/, enforced advisory in CI and blocking at runtime by a policy engine. Equivalents per cloud: AWS Service Control Policies / Config Rules, Azure Policy, GCP Organization Policy / Policy Controller; OPA Gatekeeper or Kyverno for Kubernetes-native enforcement.
Pipelines as YAML in repo — no inline secrets, no portal-defined steps.

Litmus test: if this entire repository were pushed to a public mirror tomorrow, what would leak? Anything beyond "nothing" indicates incomplete externalization. Drift detection (terraform plan, bicep what-if, pulumi preview) runs in CI on every infra change; out-of-band manual changes in production trigger an alert.

Avoid

Secrets stored in the config service (configuration is for non-sensitive values; only the vault holds credentials); "just this one config file in the image" — a single exception destroys the immutability story (see 7).

Provenance-Tracked Dependencies

Every byte that ships to production is explainable, scanned, signed, and locked.

In 2026

Lockfiles committed for every package manager — packages.lock.json, uv.lock / poetry.lock, pnpm-lock.yaml / package-lock.json, Cargo.lock, go.sum, etc.
Base images pinned by digest, not tag.
SBOMs generated per artifact (CycloneDX or SPDX), stored in the registry alongside the image.
Image signing with Sigstore (cosign) or Notation; admission policies in target environments verify before pull. Every cloud's container registry now supports signature verification on pull (AWS ECR, Azure Container Registry, GCP Artifact Registry).
Vulnerability scanning on every push; high/critical fail the build.
AWS Inspector, ECR scanning · Azure Defender for Containers · GCP Artifact Analysis, Container Analysis · Self-hosted Trivy, Grype, Clair, Snyk.
License scanning (FOSSA, Syft, Snyk License) gates copyleft creep.
Dependency update bots (Dependabot, Renovate) open PRs with full CI runs including evals (see 15) for AI-touching code.

The artifact promoted from staging to production is the same digest — no rebuilds across environments. Production admission policy refuses unsigned images.

Avoid

Transitive dependencies that bypass scans (private packages, unpinned base layers, install-time downloads); excepting "internal-only" dependencies from signing — internal is where the next supply-chain attack will originate.

Dev = CI = Prod

The dev loop on a laptop, the build in CI, and the runtime in production share dependencies, tooling, and behaviour.

In 2026

.devcontainer/devcontainer.json per workspace; one click in the IDE, identical environment on Windows + macOS + Linux.
A local compose stack brings up real backing services (relational DB, cache, broker, stream, object storage, OpenTelemetry collector) at the same major versions as cloud — Docker Compose, Podman Compose, Tilt, Skaffold, or DevPod. Cloud emulators where they exist (LocalStack for AWS, Azurite for Azure Storage, GCP firebase-tools / pubsub-emulator) substitute when a managed service has no portable equivalent.
A version manager (mise, asdf, nvm + pyenv, rtx) pins runtime versions; CI fails if a developer's local toolchain is below the floor.
CI jobs run inside the same image as the devcontainer.
Pre-commit hooks (pre-commit, lefthook, husky) run formatters and quick lints; CI re-runs them as gates.
Cross-platform shell scripts are POSIX-clean and invoked through a single task runner so Windows, macOS, and Linux developers share one entry point.

Avoid

In-memory test doubles for backing services (H2 instead of Postgres, sqlite instead of MySQL — they lie about behaviour at exactly the wrong moments); environmental drift hidden in locale, timezone, or case-sensitivity.

Build Once, Sign Once, Deploy Many

A single immutable artifact (image digest) is promoted across every environment. Configuration — not code — differs.

In 2026

Build pipeline: image → tagged with commit SHA → signed → pushed to the registry.
Release pipelines pull the exact digest (never rebuild) and deploy with environment-specific config injected from the config service / vault at startup.
Rollback is "deploy the previous digest."
Database migrations (see 3) run as pre-deploy jobs against the target environment's database before the new image accepts traffic.
A multi-stage pipeline (GitHub Actions, GitLab CI, Azure DevOps Pipelines, AWS CodePipeline, GCP Cloud Build, Argo CD, Flux, Jenkins, etc.) automates non-prod promotion on green and gates prod on manual approval plus a green smoke-test job.
Image tags: <service>:<git-sha> plus <service>:<semver> for releases. No :latest anywhere.
A release.json accompanies each build with image digest, SBOM hash, source commit, and matching infrastructure module versions.

Avoid

Mutable tags (:latest, :main) — they turn rollbacks into archaeology and admission policies into theatre; "for development only" config baked into images — there is no "for development only."

Progressive, Feature-Flagged Delivery

Releases and rollouts are independent. Code reaches production well before users; every PR is reviewable on a real environment.

In 2026

Trunk-based development plus continuous deployment to non-prod and gated deployment to prod.
Preview environments per pull request — an ephemeral deployment of the PR's image plus its required backing services. Reviewable URL posted to the PR. Torn down on merge or after a TTL. Modern platforms make this routine: AWS App Runner / ECS preview services; Azure Container Apps revisions; GCP Cloud Run revisions; Kubernetes namespaces with Argo CD ApplicationSets; Vercel / Netlify-style preview deploys for frontends.
Feature flags wrap every non-trivial change.
AWS AppConfig Feature Flags, Evidently · Azure App Configuration Feature Management · GCP Firebase Remote Config · Cloud-agnostic LaunchDarkly, Unleash, Flagsmith, GrowthBook, OpenFeature provider-of-choice.
Progressive rollout — canary, blue/green, percentage — managed by the platform: AWS CodeDeploy / App Mesh; Azure Container Apps revisions and traffic-splitting; GCP Cloud Run traffic splitting; Argo Rollouts or Flagger on Kubernetes.
Dark launches for risky changes; observe production behaviour with the user-visible flag off.
Each flag has an owner, lifespan, and target removal date in feature-flags/.

Avoid

Flag debt — alert at 30/60/90 days; rolling out a flag and a code change in the same deploy (the whole point of flags is to decouple them); long-lived preview environments — they drift from main and become their own incident surface.

Stateless, Disposable, Idempotent, Horizontal

Processes hold no long-lived state, start and stop quickly, are safe to retry, and scale by replication.

In 2026

Four facets of one architectural commitment — a service is a fungible replica.

Stateless — all long-lived state lives in backing services (see 11). No in-memory session state, no local disk persistence beyond ephemeral scratch.
Disposable — cold start under one second for HTTP/MCP requests; full readiness under five. SIGTERM handlers drain in-flight work, ack outstanding messages, and exit within the platform's grace period. Failed instances are killed and rescheduled, never debugged in place.
Idempotent — every state-changing HTTP request honours Idempotency-Key; every message handler dedupes by message ID. "At least once" delivery is assumed.
Horizontal — autoscaling on queue depth, HTTP RPS, or custom metrics. AWS Application Auto Scaling, Azure Container Apps autoscale, GCP Cloud Run concurrency-based scaling, Kubernetes HPA + KEDA. Vertical scaling is a cost-optimization choice, never a primary lever.

Real-time exception. A WebSocket / SignalR / SSE service holds connection state by definition. The discipline still applies — connection state is held in a backing service (see 11: Real-time hub) and the service process itself remains fungible. Any one replica can serve any one connection because the hub manages routing.

Shared idempotency middleware stores (idempotency-key, response-hash, expires-at) in a fast key-value store (e.g., Redis). Message handlers persist (message_id, processed_at, result_hash) before committing side effects (outbox pattern, see 12). Readiness flips to "not ready" before SIGTERM completes so the platform drains traffic.

Avoid

"Just one tiny in-memory counter" (where horizontal scaling dies); in-process caches that take minutes to warm; distributed locks as a casual coordination primitive; rolling your own WebSocket fan-out instead of using a managed real-time hub.

Self-Bound Ports for Every Audience

Each service binds its own listening ports for every audience it serves. The platform routes; the service serves.

In 2026

One service, multiple listeners:

HTTP for REST/GraphQL/gRPC — humans and other services.
MCP Streamable HTTP transport — software agents (the modern replacement for stdio in cloud).
A2A endpoints per the Agent-to-Agent protocol — peer agents needing capability discovery and trust handshakes.
Metrics scrape port if the observability backend (see 16) requires it.

The platform handles ingress, mTLS, and routing. The service is self-contained — no IIS, Tomcat, or external app server. Each runtime's entry point binds three configurable ports — PORT_HTTP, PORT_MCP, PORT_A2A — with stable defaults across local and cloud. A2A endpoints sit behind the same authentication layer (see 20) as MCP and HTTP.

Avoid

The MCP surface drifting from the HTTP surface (same business capability, different framing — not different capabilities); hosting multiple services in a single container.

Backing Services as Bound Resources

Every external dependency is attached at runtime via configuration and is swappable without redeploy.

In 2026

Backing services span a far wider class than the original 12-factor "DB + cache + broker." A service that omits any relevant class below is implicitly putting that responsibility inside the application — almost always the wrong choice. Each row gives an AWS / Azure / GCP / self-hosted option.

Class	Purpose	AWS	Azure	GCP	Self-hosted / cloud-agnostic
Relational DB	Transactional records	RDS, Aurora	Azure Database for PostgreSQL/MySQL/SQL	Cloud SQL, AlloyDB	PostgreSQL, MySQL, MariaDB, CockroachDB
Document / KV	Schema-flexible records	DynamoDB, DocumentDB	Cosmos DB	Firestore, Bigtable	MongoDB, Cassandra, ScyllaDB, Couchbase
Cache	Hot-path, session, rate-limit state	ElastiCache, MemoryDB	Azure Cache for Redis	Memorystore	Redis, KeyDB, Dragonfly, Memcached, Hazelcast
Search index	Full-text, faceted, hybrid	OpenSearch Service, Kendra	Azure AI Search	Vertex AI Search	Elasticsearch, OpenSearch, Meilisearch, Typesense, Algolia
Vector store	Embeddings, semantic retrieval	OpenSearch k-NN, Bedrock KB, Aurora pgvector	AI Search vectors, Cosmos DB vector	Vertex AI Vector Search, AlloyDB pgvector	pgvector, Qdrant, Weaviate, Milvus, Pinecone, Chroma
Object storage	Blobs, files, media, archive	S3 (Standard / IA / Glacier)	Blob Storage (Hot / Cool / Archive)	Cloud Storage (Standard / Nearline / Coldline / Archive)	MinIO, Ceph, SeaweedFS, Garage
Message broker (see 12)	Commands, work queues	SQS, Amazon MQ	Service Bus	Pub/Sub (with ordering keys)	RabbitMQ, NATS JetStream, ActiveMQ Artemis
Event stream (see 12)	Replay, CDC, fan-out, audit	Kinesis Data Streams, MSK	Event Hubs (Kafka-compatible)	Pub/Sub Lite, Dataflow	Apache Kafka, Confluent, Redpanda, Apache Pulsar
Workflow engine	Long-running orchestration	Step Functions, SWF	Durable Functions, Logic Apps	Cloud Workflows, Cloud Composer	Temporal, Cadence, Dapr Workflow, Argo Workflows, Conductor
Schema registry	Runtime contract enforcement	Glue Schema Registry	Azure Schema Registry	(application-layer; Confluent on GCP)	Confluent Schema Registry, Apicurio, Karapace
Real-time hub	Persistent connections, presence, fan-out	AppSync subscriptions, IoT Core, API Gateway WS	Azure SignalR Service, Web PubSub	Firebase Realtime Database, Firestore listeners	Centrifugo, Soketi, Pusher, Ably
Push notifications	Device delivery when app is closed	SNS Mobile Push, Pinpoint	Notification Hubs (APNS/FCM/WNS)	Firebase Cloud Messaging	OneSignal, Gotify, ntfy
Outbound communication	Email, SMS, voice, WhatsApp	SES, SNS, Pinpoint, Connect	Communication Services	(third-party; or partner add-ons)	SendGrid, Mailgun, Postmark, Twilio, MessageBird, Plivo
Inbound communication	Email-to-event, SMS receive, IVR	SES Receiving, Pinpoint two-way	Communication Services Email Receiving	(via partner)	SendGrid Inbound Parse, Twilio inbound, Mailgun Routes, signed-webhook receivers
CDN / edge cache (see 13)	Static assets, cacheable GETs	CloudFront	Front Door CDN, Azure CDN	Cloud CDN, Media CDN	Cloudflare, Fastly, Akamai, BunnyCDN, KeyCDN
Identity provider	Authentication, federation, SSO, B2C	Cognito, IAM Identity Center	Microsoft Entra ID, Entra External ID	Cloud Identity, Identity Platform, Firebase Auth	Keycloak, Authentik, Zitadel, Auth0, Okta, FusionAuth
Secrets manager	Credentials, keys, certificates	Secrets Manager, Parameter Store	Key Vault	Secret Manager	HashiCorp Vault, Bitwarden Secrets, Infisical, OpenBao
Configuration service	Non-secret values, feature flags	AppConfig	App Configuration	Runtime Config, Firebase Remote Config	Consul, etcd, Spring Cloud Config, Unleash, Flagsmith
Observability backend (see 16)	Trace, metric, log destination	CloudWatch, X-Ray, Managed Prometheus / Grafana	Application Insights, Azure Monitor	Cloud Operations (Logging / Monitoring / Trace)	Grafana stack (Loki/Mimir/Tempo), Prometheus, Datadog, New Relic, Honeycomb, Elastic, SigNoz
LLM provider	Generation, embeddings, evaluation	Bedrock, SageMaker	Azure OpenAI, AI Foundry	Vertex AI	Anthropic API, OpenAI API, Mistral API, Cohere; self-hosted Ollama, vLLM, TGI, llama.cpp
External SaaS / API	Domain-specific third-party	Payment, geocoding, OCR, KYC, mapping, B2B/EDI gateways, etc. — anything not in the rows above.

The LLM row is included deliberately. An LLM is a backing service that demands additional rigor, not a separate architectural concept.

Each binding obeys these rules. The application declares its need; the platform provides the binding. Code receives an interface, not a connection string. Provider-specific SDKs do not leak. Every binding is wrapped by a circuit breaker (see 17), tagged for cost attribution (see 22), and has a documented degradation path when the dependency is unavailable.

Class-specific discipline

Real-time hub. Connection state lives in the hub, not in the service. Service replicas can be restarted without disconnecting clients. Fan-out by subscription, not by maintaining client lists in memory. Authentication tokens for the hub are short-lived and minted by the service per connection.
Push notifications. Idempotent at the device level (notification IDs); collapsed when the user has multiple devices; respect platform rate limits and silent-hours preferences. Failure modes (token expiry, device removal) feed back into the user-preference store, not into a retry storm.
Outbound communication. Deliverability is a backing-service concern: SPF/DKIM/DMARC for email, sender registration for SMS, suppression lists for both. Bounce handling, complaints, and unsubscribe state are persisted per recipient. Templates are versioned (see 3) and sent as data, not concatenated strings.
Inbound communication. Webhook endpoints are public-internet-facing — they sit behind the edge (see 13) with signature verification (HMAC, MIME-DKIM), strict rate limits, idempotency on the parsed identifier, and a quarantine path for malformed input. Parser pipelines are versioned and observable at every stage.
CDN / edge cache. Cache keys are explicit (path + query subset + headers); cache-control headers are designed, not accidental; purge paths are documented; origin-shield strategy chosen for high-fan-out objects.
External SaaS / API integration. Configuration-driven endpoint, credentials in the vault, circuit breaker, retry policy, rate-limit handling, cost attribution, eval coverage if the result is non-deterministic (see 15), and a documented degradation path. Non-deterministic SaaS (OCR, classification, recommendation) returns confidence scores; results are validated against expected schema before use.
LLM bindings additionally require:
- Provider-agnostic abstraction — single internal interface; provider swap is config-only.
- Models pinned by full version ID (e.g., gpt-5-2025-09-01, claude-opus-4-7-20260101) — no silent provider-side upgrades.
- Prompts as code in prompts/<task>/v<n>.md, templated, versioned, eval-tested (see 15).
- Caching at multiple layers — provider prompt caching, semantic caching for retrieval, response caching for deterministic prompts.
- Cost controls — request-level budgets and circuit breakers, service-level monthly token budgets, surfaced in FinOps dashboards (see 22).

Avoid

Tight coupling to provider-specific SDKs; treating an LLM as "just an HTTP call"; rolling your own real-time fan-out, deliverability tracking, or push-notification routing when a managed service exists; using one backing service to fake another (a stream as a work queue, a database as a cache, a cache as durable storage).

Async Messaging, Scheduled Work, and Durable Workflows

Cross-service communication prefers events on a bus or stream over synchronous RPC. Scheduled work runs as cron-style jobs. Long-running flows are held by durable execution. Outbound integrations go through signed, retried, observable webhooks. Each substrate is picked for its specific shape — they are not interchangeable.

In 2026 — five patterns; pick deliberately

1. Message broker — the default for cross-service messaging. AWS SQS / Amazon MQ; Azure Service Bus; GCP Pub/Sub (with ordering keys); RabbitMQ, NATS JetStream, ActiveMQ Artemis self-hosted — for command-style messages and work queues. Rich filtering, dead-letter queues, transactional handoff. Lower throughput than streams, ack-on-consume, server-managed offsets.

Choose a broker when: a single consumer should act on each message; the message represents a command or work unit, not a historical fact; routing, filtering, or dead-letter handling matters; transactional handoff with the producing database matters; throughput is modest (thousands per second per topic, not hundreds of thousands). This covers the majority of microservice traffic. In doubt, start with the broker.

2. Event stream — for replay, fan-out, and audit-by-default. AWS Kinesis / MSK; Azure Event Hubs (Kafka-compatible); GCP Pub/Sub Lite or Dataflow; Apache Kafka / Confluent / Redpanda / Pulsar self-hosted — for high-throughput append-only logs. Replayable, multiple consumer groups, partition-ordered.

Choose a stream when: many independent consumers need the same events at their own pace; replay from a past offset is a real requirement (event sourcing, late-joining consumers, reprocessing after a bug fix, regulatory replay); throughput exceeds what brokers handle comfortably; CDC, analytics fan-out, ML pipelines, or audit streams are the use case; partition-based ordering of related events matters.

A single business event may legitimately flow through both — e.g., to a broker for immediate consumers and to a stream for analytics and replay.

Event-sourcing-as-source-of-truth (publish to stream before writing to DB; rebuild state from the stream) is a legitimate but costly pattern. It buys auditability, time-travel, and CQRS read-model freedom; it costs operational complexity, eventual-consistency reasoning at every read, and a hard dependency on stream availability for writes. Adopt it deliberately, per bounded context, with an ADR (see 23).

3. Scheduled work — cron jobs as a first-class workload. Periodic, time-triggered work runs as scheduled jobs on the platform's job runner: AWS EventBridge Scheduler, Lambda Scheduled Events, ECS Scheduled Tasks; Azure Container Apps Jobs, Logic Apps Recurrence, Functions Timer; GCP Cloud Scheduler + Cloud Run Jobs, Workflows; Kubernetes CronJobs, Argo Workflows CronWorkflow self-hosted. Discipline:

Each job has its own container image (built and signed like a service, see 5 + 7), its own SLO (see 19), its own runbook (see 19), and its own observability surface (see 16).
Triggers and schedules live in IaC (see 4), not in cron files inside images.
Jobs are idempotent (see 9) — the platform may run them twice; consumers must dedupe.
Long jobs with checkpointable state belong in a durable workflow (below), not a cron job.
Event-triggered jobs (queue-triggered, blob-triggered) are a separate sub-class — they consume from a broker or storage event source and are scaled by the platform.

4. Durable workflows — for long-running orchestration. AWS Step Functions; Azure Durable Functions; GCP Cloud Workflows; Temporal, Cadence, Dapr Workflow, Argo Workflows self-hosted — for flows spanning seconds-to-days. State lives in the engine, not in process memory. Sagas with compensating actions handle distributed transactions.

The outbox pattern (write to DB and an outbox table in the same transaction; a separate process publishes from the outbox) is the reliable way to publish events from a transactional database — and works equally for broker, stream, and outbound webhook delivery.

Long-running agent runs are a special case of durable workflows. Loop state (steps, tool calls, decisions) is checkpointed; every run carries hard limits (max steps, max wall-clock, max tool calls, max tokens, max cost). Replay from the run record enables debugging and post-mortem.

5. Outbound webhooks — for delivering events to third-party consumers. When the consumer is outside the platform — a customer's URL, a partner system — the delivery channel is a webhook, not a broker subscription. Discipline:

Each delivery is signed (HMAC) so the receiver can verify origin.
Each delivery carries an idempotency key so duplicate retries are safe at the receiver.
Failed deliveries are retried with exponential backoff by a dedicated dispatcher service or workflow engine — not by the producing service inline.
A dead-letter queue holds permanently-failed deliveries; consumers can replay from a self-service portal.
Subscription management (URL, secret, event filter) is a backing service like any other (see 11).

Envelopes carry trace context (see 16), tenant ID (see 14), and schemaVersion (see 3) — identical envelope shape regardless of substrate. Synchronous RPC is reserved for low-latency, end-user-facing reads.

Avoid

Using streams as a generic message bus when no consumer needs replay (you pay the operational cost of offsets and consumer groups for nothing); hidden synchronous chains masquerading as async (await is not an event); cron expressions buried inside application code rather than in IaC; agent loops without hard limits — a denial-of-wallet attack waiting to happen; adopting event-sourcing-as-source-of-truth platform-wide because it sounded good in a talk; firing-and-forgetting outbound webhooks inline from a request handler.

Edge, Ingress, Gateway, and CDN Discipline

Every external request enters through a hardened, observable, policy-enforced edge. Static and cacheable content is served from a CDN at the same edge tier. Internal services never face the public internet directly.

In 2026

API Gateway terminates TLS, enforces rate limits, applies WAF rules, validates tokens, performs request shaping, emits edge telemetry.
AWS API Gateway, AppSync · Azure API Management · GCP API Gateway, Apigee · Self-hosted Kong, Tyk, KrakenD, NGINX, Envoy, Traefik, HAProxy, Emissary-ingress.
WAF blocks OWASP Top 10 patterns, abusive IPs, known bot signatures.
AWS WAF, Shield · Azure Front Door WAF, Azure WAF · GCP Cloud Armor · Self-hosted ModSecurity, Coraza, Cloudflare WAF, Fastly Next-Gen WAF.
CDN sits in front of static assets, large-object downloads, and cacheable GET responses. Cache keys, TTLs, and purge paths are designed deliberately.
AWS CloudFront · Azure Front Door CDN, Azure CDN · GCP Cloud CDN, Media CDN · Self-hosted / cloud-agnostic Cloudflare, Fastly, Akamai, BunnyCDN.
Rate limiting and quotas at multiple layers — global, per-tenant, per-API-key, per-route — using token-bucket or leaky-bucket.
mTLS or workload identity for every service-to-service call inside the cluster (see 20).
Bot management separate from WAF — bots aren't always malicious.
API versioning at the edge — /v1/* and /v2/* route to appropriate backend revisions.

Per-route policies (rate limits, JWT validation, request size caps, WAF mode, cache TTL) live as YAML alongside the service's OpenAPI and apply to the gateway on PR merge. Cacheable responses carry explicit Cache-Control and Vary headers; non-cacheable responses say so explicitly. Errors follow RFC 9457 problem-details with a traceId for correlation.

Avoid

A single bypass of the edge "for one internal use case" — that bypass becomes the next incident's root cause; rate limiting reimplemented in each service (push it to the edge; services enforce only business-logic concurrency); accidental caching of authenticated responses (always vary on the auth header or explicitly mark private).

Tenancy and Blast-Radius Isolation

Tenant boundaries are explicit at every layer. A failure or breach in one tenant does not compromise another.

In 2026

Even single-tenant systems adopt explicit tenancy from day one — retrofitting it is invasive.

Tenant identity propagates through every layer: HTTP header (X-Tenant-Id), JWT claim, message envelope field, span attribute, log attribute, query predicate, row filter.
Isolation tiers chosen per resource: shared with logical separation (row-level security in PostgreSQL, partition keys in DynamoDB / Cosmos DB, prefix-keyed Redis); pooled with isolation (per-tenant database within shared cluster); dedicated (per-tenant cluster or namespace) for high-value or regulated tenants.
Blast-radius limits — bulkheads per tenant on resource pools (DB connections, thread pools, retry budgets). One noisy tenant cannot starve others.
Per-tenant cost attribution (see 22) flows from the tenant tag on every span and resource.
Per-tenant data residency enforced at the storage layer where regulation requires (see 21).
Tenant lifecycle — onboarding, suspension, deletion — has automated, audited paths. Deletion includes vector embeddings, agent memory, message archives, and blob caches.

Database schemas include a tenant_id column with row-security enforced at the database level — application code cannot disable it. An onboarding pipeline provisions per-tenant infrastructure (IaC parameters) and populates baseline configuration.

Avoid

Tenant isolation as a code-level convention without database enforcement (one missed WHERE tenant_id = ? becomes the next data-leak incident); "we'll add tenancy later" — costs grow exponentially with existing data and code volume.

Layered Testing, Including Non-Deterministic

Every change passes through a defined testing pyramid before reaching production. Each layer has a clear gate; non-deterministic systems are tested through evals as a first-class layer.

In 2026

Unit — pure logic, no I/O. Fast, deterministic. Mutation testing (Stryker, mutmut, PIT) catches what coverage misses.
Component / integration — service against real backing services from compose (Testcontainers for ephemeral provisioning).
Contract — Pact, Spring Cloud Contract, or OpenAPI / AsyncAPI conformance — verifies producers and consumers agree on every contract (see 2). Run in both producer and consumer pipelines.
End-to-end — small set of golden-path scenarios against a QA environment. Defends against integration regression; not a substitute for lower layers.
Performance — load and soak tests with k6, Locust, JMeter, Gatling, or cloud-native load testing (AWS Distributed Load Testing, Azure Load Testing, GCP load-testing patterns); gated by latency and throughput SLOs (see 19).
Resilience and chaos — AWS Fault Injection Service, Azure Chaos Studio, GCP fault-injection patterns, Chaos Mesh / Litmus / Gremlin self-hosted. Validates Factor 17 policies under stress.
Security — SAST (CodeQL, Semgrep, SonarQube), DAST (OWASP ZAP, Burp), dependency and secret scans (see 5), IaC scans (Checkov, tfsec, KICS).
Evals (for AI-touching code paths) — deterministic test sets scored by rule-based checks and LLM-as-judge. Run on every PR that touches a prompt, tool, model pin, eval dataset, or any code path the eval covers. Online evals continuously sample production with the same scorers. Frameworks: promptfoo, DeepEval, Inspect, ragas, Azure AI Foundry evaluations, AWS Bedrock evaluations, Vertex AI evaluations.

Eval datasets are versioned with semver in tests/evals/datasets/; changes go through PR review. A CI eval job posts a comment with score deltas and links. Shadow traffic — candidate prompt or model enabled for opt-in requests with outputs compared offline before promotion. Adversarial / red-team eval sets sit alongside happy-path sets and run on every release candidate.

Avoid

Testing only the happy path (adversarial tests aren't optional); eval-as-vibes — a human eyeballing a few outputs. Codify the rubric or it doesn't exist.

Observability via OpenTelemetry

Logs, metrics, and traces flow through one OpenTelemetry pipeline, including formal semantic conventions for AI workloads.

In 2026

One SDK per runtime, OTLP exporter, single collector, primary backend in the cloud's APM with logs persisted to a queryable store.
AWS CloudWatch, X-Ray, Managed Prometheus / Grafana, OpenSearch · Azure Application Insights, Azure Monitor, Log Analytics · GCP Cloud Operations (Logging, Monitoring, Trace, Profiler) · Cloud-agnostic Datadog, New Relic, Honeycomb, Elastic, SigNoz, Grafana Cloud, self-hosted Loki + Mimir + Tempo + Pyroscope.
Trace context propagated through HTTP, MCP, A2A, broker, stream, and outbound webhook — W3C Trace Context everywhere.
GenAI semantic conventions — every LLM call emits a span with gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.response.id, plus prompt and completion when not redacted.
Tool-call spans with mcp.tool.name, mcp.tool.version, arguments hash, decision outcome.
Logs structured (JSON), correlated to spans by trace ID. Every line carries trace_id, service, tenant_id, env.
RED metrics per endpoint and per consumer; USE metrics per backing service; AI metrics (eval scores, model latency, token usage, cost-per-request).

Each runtime auto-instruments HTTP, DB, broker, stream, MCP server, LLM client, real-time hub, communication channels, webhook dispatcher, and agent loop. Service code emits domain events; plumbing is automatic. A repository-wide attribute schema ensures dashboards and alerts work uniformly across services. PII redaction (see 21) runs in the SDK before export. LLM-specific UX layers — Langfuse, Phoenix, Helicone — are optional supplements, not replacements.

Avoid

Logging-as-debugging — if you can't answer the question from existing traces, fix the instrumentation, don't add a log line; per-service custom attribute names — they make cross-service dashboards impossible.

Resilience by Default

Every external interaction declares timeout, retry, circuit-breaker, bulkhead, and cost policy.

In 2026

Timeouts on every outbound call — DB, HTTP, broker publish, stream produce, LLM, MCP tool, A2A, real-time hub, communication channel, third-party SaaS.
Retries with exponential backoff and jitter, only on retry-safe operations (idempotency required, see 9). Finite caps — never unbounded.
Circuit breakers trip on error rate or latency. Half-open probes test recovery.
Bulkheads isolate failure domains — a slow LLM provider must not starve the HTTP thread pool; a backed-up email queue must not delay primary writes.
Cost circuit breakers — when token spend exceeds rate or absolute threshold, the breaker degrades (cheaper model, cached response) or fails fast.
Chaos drills — periodic intentional failure injection (broker outage, DB latency spike, LLM timeout, real-time hub partition). What is not exercised is not resilient.

Library choices follow language: Polly (.NET), tenacity + httpx-with-Hyx (Python), cockatiel or NestJS interceptors (Node), Resilience4j (JVM), failsafe-go / retry-go (Go), tower (Rust). All configured from a shared resilience policy schema. A resilience.yaml per service declares per-dependency policies; middleware loads them at startup. Resilience policies for backing services in infra/ (broker retry settings, real-time hub connection backoff, gateway retry budgets) are declared alongside the resource definition.

Avoid

Default infinite retries (they amplify outages into incidents); the same retry policy for all callers (a user request and a background job have different patience budgets); retrying on a destructive non-idempotent endpoint.

Disaster Recovery and Business Continuity

Recovery time and recovery point objectives are defined per service tier, replication strategy is explicit per data class, and restores are rehearsed.

In 2026

DR is a separate discipline from per-request resilience (see 17). Resilience handles a flapping dependency; DR handles a region going dark, an accidental delete-all, or ransomware.

RTO and RPO declared per service tier (e.g., critical: RTO < 1h, RPO < 5m; standard: RTO < 4h, RPO < 1h; non-critical: best-effort, daily backup).
Multi-region for critical services — active/active or active/passive — with failover at the edge (see 13) via weighted DNS or priority routing (AWS Route 53 health-checked routing, Azure Front Door priority, GCP Cloud DNS / Cloud Load Balancing, or self-hosted equivalents).

Data replication strategy explicit per data class:

Data class	Typical replication mode	Notes
Relational DB	Async geo-replica or active-passive geo-redundant	Promotion runbook required; lag is the RPO floor. AWS RDS / Aurora Global Database; Azure geo-replication / failover groups; GCP Cloud SQL cross-region replicas / AlloyDB
Document / NoSQL	Multi-region writes (active-active) where the data model allows	Conflict resolution policy explicit. DynamoDB Global Tables, Cosmos DB multi-region writes, Firestore multi-region, Cassandra multi-DC
Cache	Active-active geo-replication for low-latency reads, or rebuild on failover	Treat cache contents as recomputable; not the source of truth
Object storage	Cross-region replication for read access during regional outage	S3 CRR; Azure GRS / RA-GRS; GCS dual-region / multi-region; lifecycle policies replicate across regions
Message broker	Geo-DR pairing — primary-secondary alias, manual or scripted failover	In-flight messages are NOT replicated; consumers must be idempotent. SQS cross-region forwarding patterns; Service Bus geo-DR; Pub/Sub multi-region by default
Event stream	Sync geo-DR or mirror-maker — zero or near-zero data loss	Stream is the audit log; replication is non-negotiable. MSK Replicator; Event Hubs geo-DR; Pub/Sub Lite cross-region; Kafka MirrorMaker 2 / Confluent Cluster Linking
Real-time hub	Regional with client reconnection on failover	Connection state is ephemeral by design
Vector store / Search index	Rebuild from source-of-truth data store	Index in-region; reindex on regional recovery

Backups with frequency, retention, immutability (against ransomware), and restore tested on schedule. An untested backup is a hopeful guess. Cloud-native immutable storage: AWS S3 Object Lock, Azure Immutable Blob Storage, GCP Bucket Lock; self-hosted via WORM-mode object stores.
Game days — quarterly exercises that fail over a region, restore from backup, or simulate a key compromise. The runbook (see 19) is rehearsed under realistic time pressure.
Configuration and IaC backups — recovery target depends on Terraform / Pulumi / Bicep / CloudFormation state and pipeline definitions; both version-controlled and replicated.
AI-specific DR — vector index rebuilds, prompt registry replication, model fallback chains documented for provider outages.

Each service declares its tier in slos/<service>.yaml with enforceable RTO/RPO targets. Critical-tier services run in two regions with active/passive failover. Promotion runbook in runbooks/dr/<service>.md, exercised quarterly. Backup restore drills run automatically against a non-prod environment and produce a pass/fail signal for the SLO dashboard.

Avoid

Treating "the cloud" as inherently durable; untested backups — until a restore has succeeded, the backup is a hypothesis; replicating cache contents instead of regenerating them; assuming broker geo-DR replicates in-flight messages (it doesn't).

SLOs, Error Budgets, and Runbooks-as-Code

What "working" means is defined in code, measured continuously, and connected to operational decisions.

In 2026

SLIs (latency, error rate, eval scores, cost-per-request, freshness) defined in slos/<service>.yaml.
SLOs with explicit error budgets; budget burn alerts feed an on-call rotation.
Burn-rate alerts at multiple windows (1h fast burn, 6h sustained burn) preempt full budget exhaustion.
Runbooks in runbooks/<symptom>.md, linked from every alert. Runbooks are code: linted, reviewed, exercised in chaos days (see 17) and DR drills (see 18).
Synthetic probes continuously exercise critical user journeys (CloudWatch Synthetics, Application Insights Availability tests, Cloud Monitoring Uptime checks, or self-hosted Checkly / Blackbox Exporter).
Postmortems blameless, templated, committed to the repository, and feed back into SLOs and runbooks. Every incident yields at least one durable change.
Eval scores for AI-touching services are SLIs alongside latency and error rate.

Repository has top-level slos/, runbooks/, and postmortems/. CI lints alerts to ensure each has a linked runbook. SLO definitions apply to the observability backend as code (Sloth, OpenSLO, Datadog SLO IaC, Azure Monitor SLO IaC, GCP Service Monitoring SLO IaC). The on-call schedule itself is defined as code (PagerDuty / Opsgenie / Grafana OnCall configuration in oncall.yaml).

Avoid

SLOs nobody reads — if the dashboard isn't part of the regular operating cadence, the SLO is aspirational; heroic recovery without postmortems — the next incident has the same root cause and a different victim.

Zero-Trust Identity and Authorization

No network is trusted by default. Every request, from any source, carries an identity, and authorization is enforced at every layer.

In 2026

Service-to-service — managed workload identities plus mTLS. AWS IAM Roles for Service Accounts / IAM Roles Anywhere; Azure Managed Identity / Workload Identity; GCP Workload Identity Federation. Service mesh: Linkerd, Istio, Cilium, Consul Connect, AWS App Mesh, Azure Service Mesh.
Human-to-service — OAuth 2.1 / OpenID Connect via the identity provider; PKCE for public clients; refresh-token rotation; short-lived access tokens.
AWS Cognito, IAM Identity Center · Azure Microsoft Entra ID, Entra External ID · GCP Cloud Identity, Identity Platform, Firebase Auth · Cloud-agnostic Auth0, Okta, Keycloak, Authentik, Zitadel, FusionAuth.
Agent-to-service — agents authenticate as workload identities or scoped service principals; MCP tools and A2A endpoints are RBAC-gated per identity.
Tokens minimal-scope, short-lived. Long-lived secrets are a last resort and live in the vault.
Authorization at every layer — gateway (coarse), service (route), tool (capability), message handler (event type), database (row-level security).
Audit every authentication and every authorization decision (see 21).

Every HTTP, MCP, and A2A endpoint requires authentication — no anonymous routes outside /health/*. Shared auth/ middleware parses tokens and exposes a principal (user / service / agent + roles + tenant + clearance). Tool-level RBAC is declared in mcp.yaml and enforced before invocation. Destructive tools (any side effect that cannot be reversed by a subsequent call) require either elevated scope or a human-in-loop checkpoint. CSPM rules (AWS Config, Azure Defender for Cloud, GCP Security Command Center, or self-hosted Cloud Custodian / Steampipe) block deployment of any resource without a managed identity.

Avoid

"Internal" endpoints that skip auth because "they're behind the firewall" — zero-trust means no inside; long-lived API keys — they become the next credential leak.

Privacy, Data Classification, and Audit

Every data field has a classification. Collection is minimized, location is known, retention is bounded, sensitive data is redacted before crossing into prompts and observability, and every state-changing action is auditable.

In 2026

Classification taxonomy (Public / Internal / Confidential / Restricted / Personal / Sensitive Personal) applied to every field in every schema (OpenAPI, AsyncAPI, database).
Data residency enforced at the storage layer where regulation requires (GDPR, India DPDP, CCPA/CPRA, LGPD, equivalents). Cloud-native residency primitives: AWS Region/AZ controls + AWS Control Tower; Azure regional resource groups + Confidential Computing; GCP regional resources + Assured Workloads.
Minimization — collect only what's needed; drop the rest at the boundary.
Retention — TTLs on caches, logs, traces, agent memory, message archives. Each retention period is justified.
Redaction in observability — PII detected and masked before traces and logs leave the process. Tooling: AWS Macie, Azure Purview, GCP DLP / Cloud DLP, self-hosted Presidio, custom OTel processors.
Prompt hygiene — Restricted and Sensitive Personal data filtered out of LLM prompts unless the provider is contractually data-protected (e.g., Bedrock, Azure OpenAI, Vertex AI with no-training agreements).
Right-to-be-forgotten — every store has a documented and tested delete path; agent memory and vector embeddings included.
Audit trail — every state-changing action carries actor identity, target, before/after state hash, timestamp, trace ID. Audit storage is immutable (append-only object storage with object-lock, append-only audit tables) for at least the regulatory retention period.
Auditable AI actions — every agent-initiated state change records prompt, model, tool, decision chain, approver.

A repository-wide data-classification.yaml lists every field-name pattern and its classification. CI fails if a new field lacks one. A shared privacy/ middleware redacts classified fields from logs and traces. Vector stores tag every embedding with source classification; retrieval filters honor the caller's clearance. The audit log is a separate event stream piped to immutable object storage; no service writes to audit storage directly.

Avoid

Free-text fields that quietly become PII landfills (classify by the highest-classification content they might hold); audit logs in the same store as application data — co-located audit logs are tampered audit logs in the wrong incident.

FinOps as a First-Class Property

Compute, storage, network, and AI token costs are attributable per service, per tenant, per request — and visible to engineers in their normal workflow.

In 2026

Cost tags on every cloud resource (service, env, team, cost-center, tier). Enforced via AWS Tag Policies, Azure Policy, GCP Organization Policy, or Crossplane policy.
Per-request cost computed and emitted as OTel attributes (cost.tokens.prompt, cost.tokens.completion, cost.compute.ms, cost.estimated_usd).
Per-tenant attribution for multi-tenant services (tenant tag on every span and resource).
Budgets at multiple layers — subscription / account budget alerts, service-level monthly token budgets, request-level cost circuit breakers (see 17).
Cost in the PR — CI emits an estimated delta in monthly cost based on dependency, image size, infrastructure (Infracost, AWS Cost Explorer Diff, Azure Cost Management, GCP Cost Estimator, OpenCost), and LLM-call changes. Reviewers see it before merge.
Unit economics answerable at the request level. "How much does this one user journey cost?" is a question the system answers.

A nightly job rolls cost up by service, endpoint, and tenant; results appear in a service catalog tab and observability workbooks. The shared LLM client blocks calls that would exceed the per-request budget unless explicitly elevated. Quarterly cost reviews are part of the engineering cadence — owned by service teams, not finance.

Avoid

Aggregate-only dashboards (the unit-economics question requires per-request granularity); treating cost as somebody else's problem (the team that ships the code owns its operating cost).

Documentation, Decisions, and Machine-Readable Seams

The repository is structured for two readers — humans and software agents — and they're now the same audience. Decisions are captured where future readers will look.

In 2026

AI agents (Claude Code, Copilot, Cursor, Aider, Cline) are permanent collaborators. Repository structure, naming, and documentation are architectural choices that determine how effectively those collaborators — and humans — can work.

Agent-readable instructions at repository root — CLAUDE.md, .cursorrules, .github/copilot-instructions.md, AGENTS.md, .aider.conf.yml — generated from a single source explaining architecture, conventions, gotchas, and how to run things. Per-service equivalents cover service-specific concerns.
ADRs in docs/adr/NNNN-<slug>.md (MADR format) capture every architectural decision with date, status, context, options considered, decision, and consequences. Reversing an ADR requires a new ADR that supersedes it.
Small, focused files — long files defeat both human review and agent context windows.
Machine-readable seams — OpenAPI, MCP manifests, AsyncAPI, JSON Schema, type definitions, classification files, SLO files, runbook indexes — agents and humans inspect and reason about the same artifacts.
Semantic naming — OrderRepository over OrderHelper; cancelOrder() over processOrder().
Examples in tests function as documentation that cannot lie.
Architecture diagrams as code (Mermaid, Structurizr DSL, PlantUML, D2) live in the repository alongside their subjects, not in slide decks.
Onboarding guide — docs/onboarding.md — that an external collaborator (human or agent) can follow to a working dev loop without external help.

A weekly job summarizes recent ADRs and posts to the team channel — decisions don't get lost in a folder.

Avoid

Documentation that becomes a parallel universe — beautiful, ignored, wrong (tie docs to executable artifacts so drift is detectable in CI); ADRs as compliance theater — written after the fact to justify a decision already made.

Appendix A — Glossary

A2A	Agent-to-Agent — protocol for inter-agent communication including capability discovery and trust handshake.
ADR	Architecture Decision Record (MADR format).
APM	Application Performance Monitoring.
AsyncAPI	OpenAPI-equivalent specification for event-driven and message-based APIs.
CDC	Change Data Capture — streaming database changes as events.
CDN	Content Delivery Network.
CSPM	Cloud Security Posture Management.
Dapr	Distributed Application Runtime — sidecar primitives for service mesh, state, pubsub, secrets.
DPDP	India's Digital Personal Data Protection Act, 2023.
DR	Disaster Recovery.
FinOps	Discipline of managing variable cloud and AI spend as an engineering concern.
GenAI Conventions	OpenTelemetry semantic conventions for generative-AI telemetry.
IaC	Infrastructure as Code.
KEDA	Kubernetes Event-Driven Autoscaling.
LLM-as-judge	Pattern where an LLM scores another LLM's output against a rubric.
MADR	Markdown Any Decision Records — common ADR format.
MCP	Model Context Protocol — standard for exposing tools, prompts, and resources to LLM agents.
mTLS	Mutual TLS — both sides authenticate by certificate.
OIDC	OpenID Connect.
OPA	Open Policy Agent (with the Rego policy language).
OTel / OTLP	OpenTelemetry / OpenTelemetry Protocol.
PII	Personally Identifiable Information.
RAG	Retrieval-Augmented Generation.
RBAC	Role-Based Access Control.
RTO / RPO	Recovery Time Objective / Recovery Point Objective.
SBOM	Software Bill of Materials.
Sigstore / Notation	Container image and artifact signing systems.
SLI / SLO	Service Level Indicator / Service Level Objective.
WAF	Web Application Firewall.

Appendix B — Heritage

This document derives from two predecessors.

12-Factor App (Wiggins, 2012) → 23 Factors

Original factor	23 Factors
I. Codebase	1
II. Dependencies	5
III. Config	4
IV. Backing services	11
V. Build, release, run	7
VI. Processes	9
VII. Port binding	10
VIII. Concurrency	9
IX. Disposability	9
X. Dev/prod parity	6
XI. Logs	16
XII. Admin processes	7, 12, 19 (distributed)

Beyond the Twelve-Factor App (Hoffman, 2016) → 23 Factors

Hoffman factor	23 Factors
1. One Codebase	1
2. API First	2
3. Dependency Management	5
4. Design, Build, Release, Run	7 (with design distributed across 2, 5, 23)
5. Configuration, Credentials, Code	4
6. Logs	16
7. Disposability	9
8. Backing Services	11
9. Environment Parity	6
10. Administrative Processes	7, 12, 19
11. Port Binding	10
12. Stateless Processes	9
13. Concurrency	9
14. Telemetry	16
15. Authentication and Authorization	20

What 23 Factors adds

Beyond the inherited factors, this manifesto introduces explicit disciplines for: contract-first multi-audience design including agents (2), versioned backwards-compatible evolution across HTTP, MCP, events, streams, and database (3), provenance and supply-chain integrity (5), progressive feature-flagged delivery with preview environments (8), self-bound ports for HTTP / MCP / A2A audiences (10), expanded backing-services taxonomy covering real-time hubs, push, communications, CDN, LLM, and external SaaS (11), broker / stream / scheduled-job / durable-workflow / outbound-webhook discipline (12), edge / ingress / gateway / CDN as one tier (13), tenancy and blast-radius isolation (14), layered testing with evals as a first-class layer (15), OpenTelemetry for AI workloads (16), resilience by default with cost circuit breakers (17), DR with explicit replication strategy per data class (18), SLOs and runbooks-as-code (19), zero-trust identity for humans, services, and agents (20), privacy / classification / audit (21), FinOps as engineering discipline (22), and machine-readable seams for human and agent collaborators (23).