Irth Solutions logo
Irth Solutions
2026 Edition

23 Factors

Cloud-Native & AI Aware

The 23 disciplines a service must obey to be portable, observable, recoverable, secure, cost-aware, and continuously deployable in a modern cloud-native, AI-aware system.

A standing contract between every service and the platform that hosts it.
Successor to The Twelve-Factor App (Wiggins, 2012) and Beyond the Twelve-Factor App (Hoffman, 2016).

About this manifesto

What this is

A standing contract between every service and the platform that hosts it. Each factor is a discipline. A service that obeys these factors is portable, observable, recoverable, secure, cost-aware, and continuously deployable.

What this is not

A roadmap, a methodology, a feature checklist, or a substitute for product or domain design. It does not prescribe the size of a service, its language, its bounded context, or its team. It describes the properties any service must hold to be production-ready.

Heritage

These factors descend from the original Twelve-Factor App (Wiggins, 2012) and Kevin Hoffman's Beyond the Twelve-Factor App (2016). They expand that lineage to address the architecture, security, observability, AI-native, and FinOps realities of 2026. Predecessor mappings are in Appendix B.

Applicability

The principles apply uniformly to a single microservice, a domain service, or a platform-of-platforms. The discipline is universal; the investment is tier-dependent. A small service can declare modest targets (e.g., "99.5% available, RTO 4h") and be fully compliant. A critical platform service might require 99.99% and RTO 5m. Follow each factor's principle; size the implementation to the service's tier. Where the recommendations describe more rigor than a small or experimental service warrants, treat them as the upper-bound reference — adopt the principle, then choose a proportionate implementation.

The 2026 landscape this answers to

How to read this

Each factor follows the same shape: Principle (the rule), In 2026 (current tools and patterns), and an Avoid callout. Cross-references use (see N). Examples of named services are given in four flavours wherever possible: AWS, Azure, GCP, and a self-hosted / cloud-agnostic equivalent. Glossary in Appendix A; heritage in Appendix B.

The 23 Factors at a Glance

#FactorOne-line rule
1Polyglot Mono-Repo, Symmetric ServicesOne repository, many runtimes, one service shape.
2Contract-First, Multi-AudienceOpenAPI for services, MCP for agents, AsyncAPI for events — all versioned, all in repo before code.
3Versioned, Backwards-Compatible EvolutionContracts and data evolve; old consumers keep working.
4Externalized Configuration, Secrets, Infrastructure, and PolicyNothing inline; everything declarative and version-controlled.
5Provenance-Tracked DependenciesLockfiles, SBOMs, signed images, vulnerability and license scans — every byte explainable.
6Dev = CI = ProdDevcontainers and identical backing services across every environment.
7Build Once, Sign Once, Deploy ManyOne immutable artifact promoted across environments; rollback is a digest swap.
8Progressive, Feature-Flagged DeliveryCode reaches production well before users; previews exist for every PR; rollout is independent of deployment.
9Stateless, Disposable, Idempotent, HorizontalProcesses hold no state, start and stop fast, are safe to retry, scale by replication.
10Self-Bound Ports for Every AudienceEach service binds its own ports for HTTP, MCP, and A2A.
11Backing Services as Bound ResourcesEvery external dependency is configuration-bound and swappable.
12Async Messaging, Scheduled Work, and Durable WorkflowsBroker by default; streams for replay/fan-out; jobs for cron; durable execution for long flows; signed webhooks for outbound.
13Edge, Ingress, Gateway, and CDN DisciplineEvery external request enters through a hardened, observable, policy-enforced edge.
14Tenancy and Blast-Radius IsolationTenant boundaries are explicit at every layer; failures are contained.
15Layered Testing, Including Non-DeterministicUnit, integration, contract, end-to-end, performance, security, evals — each layer has a defined gate.
16Observability via OpenTelemetryOne pipeline for logs, metrics, traces, and GenAI signals.
17Resilience by DefaultEvery outbound interaction declares timeout, retry, circuit-breaker, bulkhead, and cost policy.
18Disaster Recovery and Business ContinuityRTO and RPO defined per tier, replication explicit per data class, restores rehearsed.
19SLOs, Error Budgets, and Runbooks-as-CodeDefine what "working" means; measure it; respond to it.
20Zero-Trust Identity and AuthorizationNo trusted network; every request authenticated and authorized at every layer.
21Privacy, Data Classification, and AuditClassify data, minimize collection, bound retention, redact at telemetry, audit immutably.
22FinOps as a First-Class PropertyCompute, storage, network, and AI costs are attributed per service, per tenant, per request.
23Documentation, Decisions, and Machine-Readable SeamsRepository organized for both humans and software agents.
01

Polyglot Mono-Repo, Symmetric Services

One repository holds many services across many runtimes; every service follows the same shape regardless of language.

In 2026

Polyglot mono-repos are the dominant pattern at scale (Nx, Turborepo, Bazel, Pants, or folder conventions). Language is a runtime detail; service shape is a contract — health endpoints, log format, observability instrumentation, container layout, and security middleware are identical across runtimes. Conventional commits and shared linting apply repository-wide. Per-runtime gates (formatters, type checkers, linters) run alongside repository-wide gates (commit-message lint, markdown lint, YAML lint, GitHub Actions / pipeline lint). One task entry-point per operation (bootstrap, up, test, lint) via Taskfile, just, or make so newcomers don't have to learn each runtime's idioms.

Avoid

Wire-format drift between languages (PascalCase vs. camelCase, ISO-8601 vs. epoch ms — normalize in the contract, not by convention); copy-pasted "service templates" that diverge over time (regenerate, don't fork).

02

Contract-First, Multi-Audience

Every service publishes machine-readable contracts for its three audiences — humans, other services, and software agents — before implementation.

In 2026

Three contract surfaces, all in source control:

Mocks, server stubs, and client SDKs are generated from specs. CI fails if code drifts from spec. A repository-wide aggregator (Backstage, Port, or a static catalog site) publishes the contracts for human and agent discovery.

Avoid

Auto-generating MCP tools 1:1 from REST endpoints (agents need capabilities, not CRUD verbs); free-text "options" blobs in tool inputs; treating one audience as primary and the others as afterthoughts.

03

Versioned, Backwards-Compatible Evolution

Contracts and data evolve; consumers don't break. Old versions keep working until consumers have demonstrably migrated.

In 2026

A nightly compatibility-check job replays sampled production traffic against the candidate build to catch regressions before they reach customers.

Avoid

"We'll bump everyone at once" (collapses when external agents or third-party consumers are in the mix); mixing schema and behaviour changes in one migration (separate them so each can be rolled back).

04

Externalized Configuration, Secrets, Infrastructure, and Policy

Everything that varies between environments — configuration, secrets, infrastructure topology, policy rules — lives outside the image, is declarative, and is version-controlled.

In 2026

Litmus test: if this entire repository were pushed to a public mirror tomorrow, what would leak? Anything beyond "nothing" indicates incomplete externalization. Drift detection (terraform plan, bicep what-if, pulumi preview) runs in CI on every infra change; out-of-band manual changes in production trigger an alert.

Avoid

Secrets stored in the config service (configuration is for non-sensitive values; only the vault holds credentials); "just this one config file in the image" — a single exception destroys the immutability story (see 7).

05

Provenance-Tracked Dependencies

Every byte that ships to production is explainable, scanned, signed, and locked.

In 2026

The artifact promoted from staging to production is the same digest — no rebuilds across environments. Production admission policy refuses unsigned images.

Avoid

Transitive dependencies that bypass scans (private packages, unpinned base layers, install-time downloads); excepting "internal-only" dependencies from signing — internal is where the next supply-chain attack will originate.

06

Dev = CI = Prod

The dev loop on a laptop, the build in CI, and the runtime in production share dependencies, tooling, and behaviour.

In 2026

Avoid

In-memory test doubles for backing services (H2 instead of Postgres, sqlite instead of MySQL — they lie about behaviour at exactly the wrong moments); environmental drift hidden in locale, timezone, or case-sensitivity.

07

Build Once, Sign Once, Deploy Many

A single immutable artifact (image digest) is promoted across every environment. Configuration — not code — differs.

In 2026

Avoid

Mutable tags (:latest, :main) — they turn rollbacks into archaeology and admission policies into theatre; "for development only" config baked into images — there is no "for development only."

08

Progressive, Feature-Flagged Delivery

Releases and rollouts are independent. Code reaches production well before users; every PR is reviewable on a real environment.

In 2026

Avoid

Flag debt — alert at 30/60/90 days; rolling out a flag and a code change in the same deploy (the whole point of flags is to decouple them); long-lived preview environments — they drift from main and become their own incident surface.

09

Stateless, Disposable, Idempotent, Horizontal

Processes hold no long-lived state, start and stop quickly, are safe to retry, and scale by replication.

In 2026

Four facets of one architectural commitment — a service is a fungible replica.

Real-time exception. A WebSocket / SignalR / SSE service holds connection state by definition. The discipline still applies — connection state is held in a backing service (see 11: Real-time hub) and the service process itself remains fungible. Any one replica can serve any one connection because the hub manages routing.

Shared idempotency middleware stores (idempotency-key, response-hash, expires-at) in a fast key-value store (e.g., Redis). Message handlers persist (message_id, processed_at, result_hash) before committing side effects (outbox pattern, see 12). Readiness flips to "not ready" before SIGTERM completes so the platform drains traffic.

Avoid

"Just one tiny in-memory counter" (where horizontal scaling dies); in-process caches that take minutes to warm; distributed locks as a casual coordination primitive; rolling your own WebSocket fan-out instead of using a managed real-time hub.

10

Self-Bound Ports for Every Audience

Each service binds its own listening ports for every audience it serves. The platform routes; the service serves.

In 2026

One service, multiple listeners:

The platform handles ingress, mTLS, and routing. The service is self-contained — no IIS, Tomcat, or external app server. Each runtime's entry point binds three configurable ports — PORT_HTTP, PORT_MCP, PORT_A2A — with stable defaults across local and cloud. A2A endpoints sit behind the same authentication layer (see 20) as MCP and HTTP.

Avoid

The MCP surface drifting from the HTTP surface (same business capability, different framing — not different capabilities); hosting multiple services in a single container.

11

Backing Services as Bound Resources

Every external dependency is attached at runtime via configuration and is swappable without redeploy.

In 2026

Backing services span a far wider class than the original 12-factor "DB + cache + broker." A service that omits any relevant class below is implicitly putting that responsibility inside the application — almost always the wrong choice. Each row gives an AWS / Azure / GCP / self-hosted option.

Class Purpose AWS Azure GCP Self-hosted / cloud-agnostic
Relational DBTransactional recordsRDS, AuroraAzure Database for PostgreSQL/MySQL/SQLCloud SQL, AlloyDBPostgreSQL, MySQL, MariaDB, CockroachDB
Document / KVSchema-flexible recordsDynamoDB, DocumentDBCosmos DBFirestore, BigtableMongoDB, Cassandra, ScyllaDB, Couchbase
CacheHot-path, session, rate-limit stateElastiCache, MemoryDBAzure Cache for RedisMemorystoreRedis, KeyDB, Dragonfly, Memcached, Hazelcast
Search indexFull-text, faceted, hybridOpenSearch Service, KendraAzure AI SearchVertex AI SearchElasticsearch, OpenSearch, Meilisearch, Typesense, Algolia
Vector storeEmbeddings, semantic retrievalOpenSearch k-NN, Bedrock KB, Aurora pgvectorAI Search vectors, Cosmos DB vectorVertex AI Vector Search, AlloyDB pgvectorpgvector, Qdrant, Weaviate, Milvus, Pinecone, Chroma
Object storageBlobs, files, media, archiveS3 (Standard / IA / Glacier)Blob Storage (Hot / Cool / Archive)Cloud Storage (Standard / Nearline / Coldline / Archive)MinIO, Ceph, SeaweedFS, Garage
Message broker (see 12)Commands, work queuesSQS, Amazon MQService BusPub/Sub (with ordering keys)RabbitMQ, NATS JetStream, ActiveMQ Artemis
Event stream (see 12)Replay, CDC, fan-out, auditKinesis Data Streams, MSKEvent Hubs (Kafka-compatible)Pub/Sub Lite, DataflowApache Kafka, Confluent, Redpanda, Apache Pulsar
Workflow engineLong-running orchestrationStep Functions, SWFDurable Functions, Logic AppsCloud Workflows, Cloud ComposerTemporal, Cadence, Dapr Workflow, Argo Workflows, Conductor
Schema registryRuntime contract enforcementGlue Schema RegistryAzure Schema Registry(application-layer; Confluent on GCP)Confluent Schema Registry, Apicurio, Karapace
Real-time hubPersistent connections, presence, fan-outAppSync subscriptions, IoT Core, API Gateway WSAzure SignalR Service, Web PubSubFirebase Realtime Database, Firestore listenersCentrifugo, Soketi, Pusher, Ably
Push notificationsDevice delivery when app is closedSNS Mobile Push, PinpointNotification Hubs (APNS/FCM/WNS)Firebase Cloud MessagingOneSignal, Gotify, ntfy
Outbound communicationEmail, SMS, voice, WhatsAppSES, SNS, Pinpoint, ConnectCommunication Services(third-party; or partner add-ons)SendGrid, Mailgun, Postmark, Twilio, MessageBird, Plivo
Inbound communicationEmail-to-event, SMS receive, IVRSES Receiving, Pinpoint two-wayCommunication Services Email Receiving(via partner)SendGrid Inbound Parse, Twilio inbound, Mailgun Routes, signed-webhook receivers
CDN / edge cache (see 13)Static assets, cacheable GETsCloudFrontFront Door CDN, Azure CDNCloud CDN, Media CDNCloudflare, Fastly, Akamai, BunnyCDN, KeyCDN
Identity providerAuthentication, federation, SSO, B2CCognito, IAM Identity CenterMicrosoft Entra ID, Entra External IDCloud Identity, Identity Platform, Firebase AuthKeycloak, Authentik, Zitadel, Auth0, Okta, FusionAuth
Secrets managerCredentials, keys, certificatesSecrets Manager, Parameter StoreKey VaultSecret ManagerHashiCorp Vault, Bitwarden Secrets, Infisical, OpenBao
Configuration serviceNon-secret values, feature flagsAppConfigApp ConfigurationRuntime Config, Firebase Remote ConfigConsul, etcd, Spring Cloud Config, Unleash, Flagsmith
Observability backend (see 16)Trace, metric, log destinationCloudWatch, X-Ray, Managed Prometheus / GrafanaApplication Insights, Azure MonitorCloud Operations (Logging / Monitoring / Trace)Grafana stack (Loki/Mimir/Tempo), Prometheus, Datadog, New Relic, Honeycomb, Elastic, SigNoz
LLM providerGeneration, embeddings, evaluationBedrock, SageMakerAzure OpenAI, AI FoundryVertex AIAnthropic API, OpenAI API, Mistral API, Cohere; self-hosted Ollama, vLLM, TGI, llama.cpp
External SaaS / APIDomain-specific third-partyPayment, geocoding, OCR, KYC, mapping, B2B/EDI gateways, etc. — anything not in the rows above.

The LLM row is included deliberately. An LLM is a backing service that demands additional rigor, not a separate architectural concept.

Each binding obeys these rules. The application declares its need; the platform provides the binding. Code receives an interface, not a connection string. Provider-specific SDKs do not leak. Every binding is wrapped by a circuit breaker (see 17), tagged for cost attribution (see 22), and has a documented degradation path when the dependency is unavailable.

Class-specific discipline

Avoid

Tight coupling to provider-specific SDKs; treating an LLM as "just an HTTP call"; rolling your own real-time fan-out, deliverability tracking, or push-notification routing when a managed service exists; using one backing service to fake another (a stream as a work queue, a database as a cache, a cache as durable storage).

12

Async Messaging, Scheduled Work, and Durable Workflows

Cross-service communication prefers events on a bus or stream over synchronous RPC. Scheduled work runs as cron-style jobs. Long-running flows are held by durable execution. Outbound integrations go through signed, retried, observable webhooks. Each substrate is picked for its specific shape — they are not interchangeable.

In 2026 — five patterns; pick deliberately

1. Message broker — the default for cross-service messaging. AWS SQS / Amazon MQ; Azure Service Bus; GCP Pub/Sub (with ordering keys); RabbitMQ, NATS JetStream, ActiveMQ Artemis self-hosted — for command-style messages and work queues. Rich filtering, dead-letter queues, transactional handoff. Lower throughput than streams, ack-on-consume, server-managed offsets.

Choose a broker when: a single consumer should act on each message; the message represents a command or work unit, not a historical fact; routing, filtering, or dead-letter handling matters; transactional handoff with the producing database matters; throughput is modest (thousands per second per topic, not hundreds of thousands). This covers the majority of microservice traffic. In doubt, start with the broker.

2. Event stream — for replay, fan-out, and audit-by-default. AWS Kinesis / MSK; Azure Event Hubs (Kafka-compatible); GCP Pub/Sub Lite or Dataflow; Apache Kafka / Confluent / Redpanda / Pulsar self-hosted — for high-throughput append-only logs. Replayable, multiple consumer groups, partition-ordered.

Choose a stream when: many independent consumers need the same events at their own pace; replay from a past offset is a real requirement (event sourcing, late-joining consumers, reprocessing after a bug fix, regulatory replay); throughput exceeds what brokers handle comfortably; CDC, analytics fan-out, ML pipelines, or audit streams are the use case; partition-based ordering of related events matters.

A single business event may legitimately flow through both — e.g., to a broker for immediate consumers and to a stream for analytics and replay.

Event-sourcing-as-source-of-truth (publish to stream before writing to DB; rebuild state from the stream) is a legitimate but costly pattern. It buys auditability, time-travel, and CQRS read-model freedom; it costs operational complexity, eventual-consistency reasoning at every read, and a hard dependency on stream availability for writes. Adopt it deliberately, per bounded context, with an ADR (see 23).

3. Scheduled work — cron jobs as a first-class workload. Periodic, time-triggered work runs as scheduled jobs on the platform's job runner: AWS EventBridge Scheduler, Lambda Scheduled Events, ECS Scheduled Tasks; Azure Container Apps Jobs, Logic Apps Recurrence, Functions Timer; GCP Cloud Scheduler + Cloud Run Jobs, Workflows; Kubernetes CronJobs, Argo Workflows CronWorkflow self-hosted. Discipline:

4. Durable workflows — for long-running orchestration. AWS Step Functions; Azure Durable Functions; GCP Cloud Workflows; Temporal, Cadence, Dapr Workflow, Argo Workflows self-hosted — for flows spanning seconds-to-days. State lives in the engine, not in process memory. Sagas with compensating actions handle distributed transactions.

The outbox pattern (write to DB and an outbox table in the same transaction; a separate process publishes from the outbox) is the reliable way to publish events from a transactional database — and works equally for broker, stream, and outbound webhook delivery.

Long-running agent runs are a special case of durable workflows. Loop state (steps, tool calls, decisions) is checkpointed; every run carries hard limits (max steps, max wall-clock, max tool calls, max tokens, max cost). Replay from the run record enables debugging and post-mortem.

5. Outbound webhooks — for delivering events to third-party consumers. When the consumer is outside the platform — a customer's URL, a partner system — the delivery channel is a webhook, not a broker subscription. Discipline:

Envelopes carry trace context (see 16), tenant ID (see 14), and schemaVersion (see 3) — identical envelope shape regardless of substrate. Synchronous RPC is reserved for low-latency, end-user-facing reads.

Avoid

Using streams as a generic message bus when no consumer needs replay (you pay the operational cost of offsets and consumer groups for nothing); hidden synchronous chains masquerading as async (await is not an event); cron expressions buried inside application code rather than in IaC; agent loops without hard limits — a denial-of-wallet attack waiting to happen; adopting event-sourcing-as-source-of-truth platform-wide because it sounded good in a talk; firing-and-forgetting outbound webhooks inline from a request handler.

13

Edge, Ingress, Gateway, and CDN Discipline

Every external request enters through a hardened, observable, policy-enforced edge. Static and cacheable content is served from a CDN at the same edge tier. Internal services never face the public internet directly.

In 2026

Per-route policies (rate limits, JWT validation, request size caps, WAF mode, cache TTL) live as YAML alongside the service's OpenAPI and apply to the gateway on PR merge. Cacheable responses carry explicit Cache-Control and Vary headers; non-cacheable responses say so explicitly. Errors follow RFC 9457 problem-details with a traceId for correlation.

Avoid

A single bypass of the edge "for one internal use case" — that bypass becomes the next incident's root cause; rate limiting reimplemented in each service (push it to the edge; services enforce only business-logic concurrency); accidental caching of authenticated responses (always vary on the auth header or explicitly mark private).

14

Tenancy and Blast-Radius Isolation

Tenant boundaries are explicit at every layer. A failure or breach in one tenant does not compromise another.

In 2026

Even single-tenant systems adopt explicit tenancy from day one — retrofitting it is invasive.

Database schemas include a tenant_id column with row-security enforced at the database level — application code cannot disable it. An onboarding pipeline provisions per-tenant infrastructure (IaC parameters) and populates baseline configuration.

Avoid

Tenant isolation as a code-level convention without database enforcement (one missed WHERE tenant_id = ? becomes the next data-leak incident); "we'll add tenancy later" — costs grow exponentially with existing data and code volume.

15

Layered Testing, Including Non-Deterministic

Every change passes through a defined testing pyramid before reaching production. Each layer has a clear gate; non-deterministic systems are tested through evals as a first-class layer.

In 2026

Eval datasets are versioned with semver in tests/evals/datasets/; changes go through PR review. A CI eval job posts a comment with score deltas and links. Shadow traffic — candidate prompt or model enabled for opt-in requests with outputs compared offline before promotion. Adversarial / red-team eval sets sit alongside happy-path sets and run on every release candidate.

Avoid

Testing only the happy path (adversarial tests aren't optional); eval-as-vibes — a human eyeballing a few outputs. Codify the rubric or it doesn't exist.

16

Observability via OpenTelemetry

Logs, metrics, and traces flow through one OpenTelemetry pipeline, including formal semantic conventions for AI workloads.

In 2026

Each runtime auto-instruments HTTP, DB, broker, stream, MCP server, LLM client, real-time hub, communication channels, webhook dispatcher, and agent loop. Service code emits domain events; plumbing is automatic. A repository-wide attribute schema ensures dashboards and alerts work uniformly across services. PII redaction (see 21) runs in the SDK before export. LLM-specific UX layers — Langfuse, Phoenix, Helicone — are optional supplements, not replacements.

Avoid

Logging-as-debugging — if you can't answer the question from existing traces, fix the instrumentation, don't add a log line; per-service custom attribute names — they make cross-service dashboards impossible.

17

Resilience by Default

Every external interaction declares timeout, retry, circuit-breaker, bulkhead, and cost policy.

In 2026

Library choices follow language: Polly (.NET), tenacity + httpx-with-Hyx (Python), cockatiel or NestJS interceptors (Node), Resilience4j (JVM), failsafe-go / retry-go (Go), tower (Rust). All configured from a shared resilience policy schema. A resilience.yaml per service declares per-dependency policies; middleware loads them at startup. Resilience policies for backing services in infra/ (broker retry settings, real-time hub connection backoff, gateway retry budgets) are declared alongside the resource definition.

Avoid

Default infinite retries (they amplify outages into incidents); the same retry policy for all callers (a user request and a background job have different patience budgets); retrying on a destructive non-idempotent endpoint.

18

Disaster Recovery and Business Continuity

Recovery time and recovery point objectives are defined per service tier, replication strategy is explicit per data class, and restores are rehearsed.

In 2026

DR is a separate discipline from per-request resilience (see 17). Resilience handles a flapping dependency; DR handles a region going dark, an accidental delete-all, or ransomware.

Data replication strategy explicit per data class:

Data classTypical replication modeNotes
Relational DBAsync geo-replica or active-passive geo-redundantPromotion runbook required; lag is the RPO floor. AWS RDS / Aurora Global Database; Azure geo-replication / failover groups; GCP Cloud SQL cross-region replicas / AlloyDB
Document / NoSQLMulti-region writes (active-active) where the data model allowsConflict resolution policy explicit. DynamoDB Global Tables, Cosmos DB multi-region writes, Firestore multi-region, Cassandra multi-DC
CacheActive-active geo-replication for low-latency reads, or rebuild on failoverTreat cache contents as recomputable; not the source of truth
Object storageCross-region replication for read access during regional outageS3 CRR; Azure GRS / RA-GRS; GCS dual-region / multi-region; lifecycle policies replicate across regions
Message brokerGeo-DR pairing — primary-secondary alias, manual or scripted failoverIn-flight messages are NOT replicated; consumers must be idempotent. SQS cross-region forwarding patterns; Service Bus geo-DR; Pub/Sub multi-region by default
Event streamSync geo-DR or mirror-maker — zero or near-zero data lossStream is the audit log; replication is non-negotiable. MSK Replicator; Event Hubs geo-DR; Pub/Sub Lite cross-region; Kafka MirrorMaker 2 / Confluent Cluster Linking
Real-time hubRegional with client reconnection on failoverConnection state is ephemeral by design
Vector store / Search indexRebuild from source-of-truth data storeIndex in-region; reindex on regional recovery

Each service declares its tier in slos/<service>.yaml with enforceable RTO/RPO targets. Critical-tier services run in two regions with active/passive failover. Promotion runbook in runbooks/dr/<service>.md, exercised quarterly. Backup restore drills run automatically against a non-prod environment and produce a pass/fail signal for the SLO dashboard.

Avoid

Treating "the cloud" as inherently durable; untested backups — until a restore has succeeded, the backup is a hypothesis; replicating cache contents instead of regenerating them; assuming broker geo-DR replicates in-flight messages (it doesn't).

19

SLOs, Error Budgets, and Runbooks-as-Code

What "working" means is defined in code, measured continuously, and connected to operational decisions.

In 2026

Repository has top-level slos/, runbooks/, and postmortems/. CI lints alerts to ensure each has a linked runbook. SLO definitions apply to the observability backend as code (Sloth, OpenSLO, Datadog SLO IaC, Azure Monitor SLO IaC, GCP Service Monitoring SLO IaC). The on-call schedule itself is defined as code (PagerDuty / Opsgenie / Grafana OnCall configuration in oncall.yaml).

Avoid

SLOs nobody reads — if the dashboard isn't part of the regular operating cadence, the SLO is aspirational; heroic recovery without postmortems — the next incident has the same root cause and a different victim.

20

Zero-Trust Identity and Authorization

No network is trusted by default. Every request, from any source, carries an identity, and authorization is enforced at every layer.

In 2026

Every HTTP, MCP, and A2A endpoint requires authentication — no anonymous routes outside /health/*. Shared auth/ middleware parses tokens and exposes a principal (user / service / agent + roles + tenant + clearance). Tool-level RBAC is declared in mcp.yaml and enforced before invocation. Destructive tools (any side effect that cannot be reversed by a subsequent call) require either elevated scope or a human-in-loop checkpoint. CSPM rules (AWS Config, Azure Defender for Cloud, GCP Security Command Center, or self-hosted Cloud Custodian / Steampipe) block deployment of any resource without a managed identity.

Avoid

"Internal" endpoints that skip auth because "they're behind the firewall" — zero-trust means no inside; long-lived API keys — they become the next credential leak.

21

Privacy, Data Classification, and Audit

Every data field has a classification. Collection is minimized, location is known, retention is bounded, sensitive data is redacted before crossing into prompts and observability, and every state-changing action is auditable.

In 2026

A repository-wide data-classification.yaml lists every field-name pattern and its classification. CI fails if a new field lacks one. A shared privacy/ middleware redacts classified fields from logs and traces. Vector stores tag every embedding with source classification; retrieval filters honor the caller's clearance. The audit log is a separate event stream piped to immutable object storage; no service writes to audit storage directly.

Avoid

Free-text fields that quietly become PII landfills (classify by the highest-classification content they might hold); audit logs in the same store as application data — co-located audit logs are tampered audit logs in the wrong incident.

22

FinOps as a First-Class Property

Compute, storage, network, and AI token costs are attributable per service, per tenant, per request — and visible to engineers in their normal workflow.

In 2026

A nightly job rolls cost up by service, endpoint, and tenant; results appear in a service catalog tab and observability workbooks. The shared LLM client blocks calls that would exceed the per-request budget unless explicitly elevated. Quarterly cost reviews are part of the engineering cadence — owned by service teams, not finance.

Avoid

Aggregate-only dashboards (the unit-economics question requires per-request granularity); treating cost as somebody else's problem (the team that ships the code owns its operating cost).

23

Documentation, Decisions, and Machine-Readable Seams

The repository is structured for two readers — humans and software agents — and they're now the same audience. Decisions are captured where future readers will look.

In 2026

AI agents (Claude Code, Copilot, Cursor, Aider, Cline) are permanent collaborators. Repository structure, naming, and documentation are architectural choices that determine how effectively those collaborators — and humans — can work.

A weekly job summarizes recent ADRs and posts to the team channel — decisions don't get lost in a folder.

Avoid

Documentation that becomes a parallel universe — beautiful, ignored, wrong (tie docs to executable artifacts so drift is detectable in CI); ADRs as compliance theater — written after the fact to justify a decision already made.

Appendix A — Glossary

A2AAgent-to-Agent — protocol for inter-agent communication including capability discovery and trust handshake.
ADRArchitecture Decision Record (MADR format).
APMApplication Performance Monitoring.
AsyncAPIOpenAPI-equivalent specification for event-driven and message-based APIs.
CDCChange Data Capture — streaming database changes as events.
CDNContent Delivery Network.
CSPMCloud Security Posture Management.
DaprDistributed Application Runtime — sidecar primitives for service mesh, state, pubsub, secrets.
DPDPIndia's Digital Personal Data Protection Act, 2023.
DRDisaster Recovery.
FinOpsDiscipline of managing variable cloud and AI spend as an engineering concern.
GenAI ConventionsOpenTelemetry semantic conventions for generative-AI telemetry.
IaCInfrastructure as Code.
KEDAKubernetes Event-Driven Autoscaling.
LLM-as-judgePattern where an LLM scores another LLM's output against a rubric.
MADRMarkdown Any Decision Records — common ADR format.
MCPModel Context Protocol — standard for exposing tools, prompts, and resources to LLM agents.
mTLSMutual TLS — both sides authenticate by certificate.
OIDCOpenID Connect.
OPAOpen Policy Agent (with the Rego policy language).
OTel / OTLPOpenTelemetry / OpenTelemetry Protocol.
PIIPersonally Identifiable Information.
RAGRetrieval-Augmented Generation.
RBACRole-Based Access Control.
RTO / RPORecovery Time Objective / Recovery Point Objective.
SBOMSoftware Bill of Materials.
Sigstore / NotationContainer image and artifact signing systems.
SLI / SLOService Level Indicator / Service Level Objective.
WAFWeb Application Firewall.

Appendix B — Heritage

This document derives from two predecessors.

12-Factor App (Wiggins, 2012) → 23 Factors

Original factor23 Factors
I. Codebase1
II. Dependencies5
III. Config4
IV. Backing services11
V. Build, release, run7
VI. Processes9
VII. Port binding10
VIII. Concurrency9
IX. Disposability9
X. Dev/prod parity6
XI. Logs16
XII. Admin processes7, 12, 19 (distributed)

Beyond the Twelve-Factor App (Hoffman, 2016) → 23 Factors

Hoffman factor23 Factors
1. One Codebase1
2. API First2
3. Dependency Management5
4. Design, Build, Release, Run7 (with design distributed across 2, 5, 23)
5. Configuration, Credentials, Code4
6. Logs16
7. Disposability9
8. Backing Services11
9. Environment Parity6
10. Administrative Processes7, 12, 19
11. Port Binding10
12. Stateless Processes9
13. Concurrency9
14. Telemetry16
15. Authentication and Authorization20

What 23 Factors adds

Beyond the inherited factors, this manifesto introduces explicit disciplines for: contract-first multi-audience design including agents (2), versioned backwards-compatible evolution across HTTP, MCP, events, streams, and database (3), provenance and supply-chain integrity (5), progressive feature-flagged delivery with preview environments (8), self-bound ports for HTTP / MCP / A2A audiences (10), expanded backing-services taxonomy covering real-time hubs, push, communications, CDN, LLM, and external SaaS (11), broker / stream / scheduled-job / durable-workflow / outbound-webhook discipline (12), edge / ingress / gateway / CDN as one tier (13), tenancy and blast-radius isolation (14), layered testing with evals as a first-class layer (15), OpenTelemetry for AI workloads (16), resilience by default with cost circuit breakers (17), DR with explicit replication strategy per data class (18), SLOs and runbooks-as-code (19), zero-trust identity for humans, services, and agents (20), privacy / classification / audit (21), FinOps as engineering discipline (22), and machine-readable seams for human and agent collaborators (23).