Evolving Scalability and Maintainability from 10K to 1M+ Daily Active Users

May 28, 2025

When building a platform that must grow from a few thousand daily active users (DAU) to hundreds of thousands or even millions, the guiding principle is always the same: do today what makes tomorrow easier. In practice, that means structuring code and infrastructure so that each layer can be replaced, scaled or forked without rewriting the entire system. Below is a detailed account of how I would approach scalability and maintainability at three distinct stages of growth:

  1. Early Stage (DAU < 10 000)
  2. Intermediate Stage (DAU 100 000–1 000 000)
  3. Large-Scale, Multi-Region Stage (DAU > 1 000 000)

Throughout, I focus on concrete patterns, trade-offs and examples rather than marketing language or buzzwords.


1. Early Stage (DAU < 10 000): “Get the foundation right”

At this stage, traffic levels are modest. The goal is not to over-engineer but to ensure that the codebase and deployment pipelines follow solid engineering fundamentals. By being deliberate now, you avoid a tangled web of technical debt later.

1.1 Single Responsibility & Modular Design

  • Separate by functional domain.

    • On the backend, split each major feature (for instance, “agent orchestration,” “knowledge-graph queries,” “user API,” “authentication/token service”) into its own folder, module or package.

    • Within each module, create distinct layers:

      1. Models (SQLAlchemy classes or ORM equivalents)
      2. Services (pure business logic, no HTTP routing or framework code)
      3. Resources / Controllers (Flask Blueprints or equivalent, translating HTTP requests into service calls)
      4. Utilities (encryption, logging wrappers, configuration loaders)

    By enforcing that each file or class does exactly one thing—no mixed concerns—you can trace a bug, add a feature or refactor one piece without worrying that you’ll break something halfway across the repo.

  • Frontend: component boundaries

    • In a React + Next.js codebase, organize folders by domain (e.g., /components/AgentList, /components/TaskEditor, /components/SettingsModal).
    • Each component should be “pure” when possible—receive data via props, emit events via callbacks, and avoid reaching into global state unless absolutely necessary.
    • When state is shared (e.g., which agent is selected), confine it to a top-level “store” (Redux slice or Context hook), but avoid sprawling, application-wide stores; only keep global what truly must be global.

1.2 API & Security Best Practices

  • RESTful conventions

    • Use consistent HTTP methods (GET for retrieval, POST for creation, PUT/PATCH for updates, DELETE for removals).
    • Define resource paths clearly—e.g. /api/agent/{agent_id}/design rather than embedding logic into query parameters.
    • Validate user input early using a request parser (Flask’s reqparse or Pydantic models) so malformed data never reaches deeper layers.
  • Authentication & Authorization

    • Store JWTs (or OAuth tokens) in secure HTTP-only cookies or, if you must use local storage, accompany them with a short-lived refresh token protocol.
    • Every protected route in Flask should be decorated with something like @token_required that checks expiration, roles and scopes.
    • On the frontend, guard private pages via a route middleware that verifies login state before rendering.
  • Secure defaults

    • Use HTTPS at every layer—load balancer, ingress controller, local development.
    • Enable CORS only on origins you control.
    • Sanitize all strings before using in SQL queries or file writes.

1.3 CI/CD, QA & Observability

  • Continuous Integration

    • Set up a pipeline (GitHub Actions, GitLab CI, CircleCI) that runs:

      1. Linting (e.g., flake8 or pylint for Python, eslint for TypeScript).
      2. Unit tests (pytest on the backend, Jest/React Testing Library on the frontend).
      3. Type checks (mypy or Pyright for Python, TypeScript compiler for the frontend).
    • Fail the build if any step errors. Early feedback saves hours of debugging later.

  • Continuous Deployment

    • With DAU under 10 000, it’s acceptable to redeploy a monolithic Flask+React Docker image once tests pass. Use a staging environment tied to your main branch, and manual approval before “production.”
  • Observability

    • Integrate a logging framework that writes structured logs (JSON) with at least: timestamp, service name, request ID, user ID, endpoint, and error stack trace.

    • Deploy a single Prometheus instance scraping:

      • Flask (use Prometheus Python client to track request latency, error count).
      • Next.js (track page rendering times, bundle sizes).
      • Container-level metrics (CPU, memory).
    • Wire Grafana to show dashboards like “HTTP 500s per minute,” “average response time,” “Redis hit/miss rate.”

  • Local development & testing

    • Use Docker Compose to spin up Postgres, Redis, Flask, and Next.js in one command. Include a “seed” script to prepopulate sample users, agents and tasks so front-end changes can be tested against realistic data.
    • Write a handful of integration tests that run against a throwaway Postgres container—ensuring your database migrations and schema are valid.

1.4 Distributed Framework Awareness

  • Preparation, not premature optimization

    • Even though you aren’t running a Kafka cluster today, structure your code so that deferring a heavy computation (e.g., RL training, large embedding generation) to a “worker pool” is a one-line change.

    • In Python, this might look like:

      # Current: synchronous call def train_model(task_id): result = rl_engine.update(task_data) save(result) # Prepared: abstract call to a queue or direct class Trainer: @staticmethod def train(task_id, sync=True): if sync: return rl_engine.update(task_data) else: # publish to RabbitMQ or Kafka publish("train_queue", task_id)
    • That way, when message queues become necessary at 100 000 users, you only flip a flag rather than rewriting your service.

Summary for DAU < 10 000

  • Focus on clean code: single responsibility, modular layers, no tangled dependencies.
  • Follow RESTful and security practices—input validation, token checks, HTTPS everywhere.
  • Build out basic CI/CD pipelines and observability from day one, even if the volume is low.
  • Design with deferred distribution in mind: you won’t set up Kafka today, but you’ll write your code so that plugging it in later is trivial.

2. Intermediate Stage (DAU 100 000–1 000 000): “Divide, Conquer, Cache”

When you cross into hundreds of thousands of daily users, a single codebase running on one or two servers will start to buckle. Both performance (throughput, latency) and organizational considerations (multiple teams, multiple responsibilities) come into play. At this point, we layer in service-level partitioning, distributed data stores, caching, and asynchronous processing.

2.1 Service Partitioning

  • Why split services?

    • At scale, every extra millisecond of latency and every extra CPU cycle across 100 000 requests per minute adds up. If “agent orchestration,” “knowledge-graph queries,” “RAG retrieval” and “user profile management” all live in one Flask app, a sudden spike in RAG traffic can slow down unrelated endpoints.

    • By splitting each major capability into its own microservice (or namespace in Kubernetes), you gain:

      1. Independent scaling (spin up more replicas of the RAG service without touching the user API).
      2. Isolated failures (a memory leak in the knowledge-graph service doesn’t take down authentication).
      3. Clearer ownership (teams can own a bounded context end-to-end).
  • Example services

    1. Agent Orchestration Service

      • Exposes /run-agent and /get-agent-status.
      • Contains logic to parse a task DAG, fetch each task’s config, and route requests to LLMs or downstream DBs.
      • Packed into its own Docker image: ohwise/agent-orchestrator:v2.
    2. RAG Retrieval Service

      • Exposes /search-knowledge which accepts a query, looks up embeddings in a local vector store, then calls a ranking or chunk-retrieval routine.
      • Runs in a separate container that mounts its own vector index (e.g., FAISS or Weaviate).
    3. Knowledge-Graph Service

      • Exposes CRUD endpoints for nodes and edges, plus a /traverse endpoint that returns subgraphs given a starting node.
      • Runs on top of a Neo4j Causal Cluster—two or three Neo4j instances in a cluster.
    4. User API Service

      • Manages user profiles, permissions, OAuth tokens.
      • Has its own Postgres database, with only user-specific tables: users, user_integrations, account_users, etc.
  • Inter-service communication

    • Synchronous HTTP when low-latency calls are needed (e.g., check user permissions before running a task). Use a discovery mechanism or environment variables to locate each endpoint.
    • Asynchronous messaging for heavier jobs (e.g., “re‐embed all knowledge on new vector update”). Publish tasks to Kafka/RabbitMQ under topics like embedding_updates or rl_training_jobs.

2.2 Distributed Data Stores

  • Why cluster the databases?

    • A single Postgres at 100 000 users can be a single point of contention when you have thousands of writes per minute—especially if you store large JSONB blobs for agent graphs.
    • Neo4j single node will chug on graph traversals once the number of nodes and edges grows into the tens of millions.
  • Example: Knowledge Graph in a Neo4j Causal Cluster

    1. Neo4j Causal Cluster setup

      • Three nodes: two data replicas, one read-only follower.
      • Leader handles all writes; followers replicate data asynchronously.
      • If the leader fails, one of the replicas steps up.
    2. Horizontal scale

      • As we add shards, we partition by “agent_id modulo N.” Each shard holds the subgraph for a set of agents.
      • Queries for a single agent’s graph only touch one shard, reducing cross-cluster traffic.
  • Example: Vector Database

    1. Weaviate or FAISS cluster

      • Instead of storing all embeddings on a single node, run a multi-replica Weaviate cluster behind a load balancer. Each node holds a copy of frequently accessed vectors; new vectors are sharded by hash of document ID.
      • The RAG Retrieval Service routes each “search” request to the appropriate Weaviate node based on the shard key.
  • User Data

    • Instead of a single Postgres instance, use a managed Postgres cluster with read replicas. Write traffic goes to the primary; analytics and read-heavy endpoints (e.g., “list all agents for user”) go to a read replica.
    • Periodically archive old logs, old action queues and old RL training metadata into a cold-storage data warehouse (BigQuery, Redshift) so that the primary DB stays lean.

2.3 Distributed Caching

  • Why use Redis as a cluster?

    • When hundreds of thousands of requests hit your platform per hour, repeated operations—like “look up user’s preferred model parameters” or “cache the last‐used embedding results for this user’s recent queries”—should not hammer the database each time.

    • A Redis cluster (sentinel or clustered mode) gives you:

      1. High availability (failover if one Redis node dies)
      2. Horizontal partitioning (hash slots spread keys across nodes)
  • Caching patterns

    1. User Sessions / Tokens

      • Store session_token → user_id mappings in Redis with a TTL (e.g., 24 hours). This makes every authenticated request a single O(1) lookup rather than a SQL join on the sessions table.
    2. Frequently accessed embeddings

      • If a user queries the same knowledge base “How to scale in Flask?” multiple times within a day, keep the result set in Redis for 5–10 minutes so you can return the cached snippet rather than a fresh semantic search.
    3. Agent state

      • When an agent is mid‐conversation with a user, store its current DAG position, last intermediate reasoning result and partial chat history in Redis under a TTL key. If the frontend reconnects, you can resume from memory rather than reloading the entire conversation from Postgres.

2.4 Asynchronous Processing & Message Queues

  • Why offload work?

    • Some operations—like updating RL models or re-indexing the entire knowledge base—take minutes or hours. They cannot block user-facing requests.
    • By pushing these tasks into a broker (Kafka or RabbitMQ), you decouple the user API from the heavy lifting.
  • Example: Reinforcement Learning Updates

    1. User submits feedback (“Agent gave a wrong answer”): the UI calls /api/agent/{id}/feedback.
    2. The feedback service publishes a message to Kafka under topic "rl_feedback" with payload {agent_id, user_feedback, timestamp}.
    3. Worker pool (e.g., a Kubernetes Job or separate Python process) subscribes to "rl_feedback", batches these feedback messages every 5 minutes, runs a gradient update on the agent’s policy network, pushes the updated weights back into a shared model store (S3 or a model registry).
    4. Once the new model is published, other services (the inference service) will pull the updated weights on the next warmup cycle.
  • Example: Large File Exports

    • If a user wants to export “all agent logs for the past year” as a CSV, the request enqueues a job in "csv_export_queue". A worker reads from that queue, streams the data from Postgres into S3 in chunks, and when it’s done, writes a record to a “jobs” table. The user can poll /api/job/{id} to see the job status and download link.

2.5 Inside Each Service: Fine-Tuning Performance

Even within a single microservice, you cannot treat the code as “just Python.” At 100 000+ users, you need to optimize memory usage, CPU cycles and I/O patterns.

  1. Profile & Identify Bottlenecks

    • Use tools like Py-Spy or cProfile to find “hot” functions. If an endpoint spends 80 % of its time parsing JSON blobs, consider switching from the built-in json library to ujson or orjson.
    • For CPU-bound tasks (e.g., text embedding), can you delegate to a C-extension or even a GPU-enabled service?
  2. In-Memory Data Management

    • If you frequently manipulate large dictionaries or JSONB objects in Python, keep them in native Python dicts only as long as needed. Once you finish, delete references and let the garbage collector run—avoid keeping big structures around longer than necessary.
    • Use __slots__ in classes that you instantiate thousands of times (for nodes or edges) so that Python doesn’t create a __dict__ for each instance, reducing memory overhead.
  3. Efficient Garbage Collection

    • In long-running Flask workers, periodically call gc.collect() if you notice memory bloat from cyclic references.
    • Alternatively, run short-lived worker processes (e.g., a Celery worker with a --max-tasks-per-child=1000 setting) so that any hidden memory leaks get killed every 1000 tasks and the process restarts with a clean slate.
  4. Reusability & Shared Libraries

    • Extract common routines—like “marshal an agent’s graph from JSONB to a Python networkx object”—into a shared internal library. This avoids copy-paste and ensures bug fixes propagate everywhere.
    • Host that library in a private PyPI or a local Git repository so each microservice can pin a version (e.g., ohwise-utils==2.1.0) rather than copy the code.
  5. Connection Pooling & Resource Limits

    • In SQLAlchemy, use a connection pool size that matches your replica count. If you spin up 10 replicas of the Knowledge-Graph Service, each pod should not open more than, say, 15 Postgres connections, so that you don’t exhaust the database’s max_connections.
    • Set CPU and memory requests/limits in Kubernetes so the scheduler knows how to place pods and can throttle any runaway container.

3. Large-Scale, Multi-Region Stage (DAU > 1 000 000): “Global Deployment & Sharding”

Once daily active users exceed one million—and especially when those users span multiple continents—latency, fault tolerance, data sovereignty and regulatory compliance become paramount. At this point, you treat each component as an independent, geo-distributed system.

3.1 Multi-Region Kubernetes Clusters & Geo-Load Balancing

  • Why multi-region?

    • A user in Tokyo expects sub-200 ms response times. If all your servers live in US-East, a single request hops around the globe. Deploy Kubernetes clusters in:

      1. us-east-1 (N. Virginia)
      2. us-west-2 (Oregon)
      3. eu-central-1 (Frankfurt)
      4. ap-northeast-1 (Tokyo)
  • DNS-Based Routing

    • Use Route 53 (AWS) or Cloudflare’s load balancer with Geo steering so that each user’s DNS resolves to the nearest EKS/GKE cluster.
    • In each region, run identical sets of services—Agent Orchestrator, RAG Service, Knowledge DB API, etc.
  • Cross-Region Data Replication

    • Postgres

      • Use a managed solution like Amazon Aurora Global Database or CockroachDB. Writes happen in one primary region; data is replicated to secondaries in sub-200 ms. If the primary goes down, you can promote a secondary with minimal data loss.
    • Neo4j

      • Use Neo4j Fabric, which allows you to federate multiple clusters. Each region’s cluster holds a primary copy of its local agents’ subgraphs, but cross-region reads can pull from remote clusters.
    • Redis

      • A global Redis Enterprise cluster can replicate data asynchronously across regions. User session data should be careful—either confine to region or ensure synchronized session keys.

3.2 Advanced Data Partitioning & Sharding

  • Vector Store Sharding

    • For a global user base, embeddings can outgrow memory on a single node. Partition the vector index by:

      1. Embedding Namespace (e.g., “tech_docs,” “marketing_docs,” each in its own shard)
      2. User Cohort (e.g., “American English,” “Japanese,” “German” indices), so that language-specific searches hit smaller, more relevant partitions.
    • The RAG Retrieval Service reads a “routing table” (simple KV in Redis) that says: “If query contains language tag = ‘ja’, route to shard 3.” The shards run behind a regional load balancer.

  • High-Write OLTP Sharding

    • Tables like agent_feedback_logs or training_metadata see thousands of inserts per second once you have agents interacting constantly. A single Postgres instance simply can’t keep up. Two options:

      1. Vitess (sharded MySQL on top of Kubernetes). You define a sharding key (e.g., agent_id % N). Each insert goes to the correct shard automatically. Vitess handles rebalancing if one MySQL node becomes overloaded.
      2. CockroachDB (Postgres-compatible, multi-active-master). Data is automatically sharded and rebalanced. With COPY FROM STDIN and UPSERT, you can load logs quickly.
  • Read Replicas for Analytics

    • Rather than let long-running analytics queries block OLTP operations, set up a streaming replication to a separate “analytics” cluster. This cluster can run Spark or Druid on historical logs without impacting the primary.

3.3 Service Mesh & Inter-Service Security

  • Why a Service Mesh?

    • When you have dozens of microservices in each region—and each region might need to communicate with the others—a service mesh like Istio or Linkerd provides:

      1. Automatic mTLS between pods.
      2. Fine-grained traffic policies (e.g., route 10 % of requests for /api/agent in us-east to a canary version).
      3. Observability (distributed tracing, per-service metrics, request-level logs).
  • mTLS & Zero-Trust

    • By default, all pods have a sidecar proxy (Envoy or similar). Each request carries a TLS certificate issued by the mesh control plane. Even if a container is compromised, it cannot pretend to be another service—because it lacks the private key.
  • Regional Gateways

    • In each region, expose an Ingress Gateway that terminates TLS from clients. Inside the mesh, traffic is encrypted via mTLS. Ingress gateways can also perform geo-based routing or canary deployments.
  • Example: Rolling out a breaking change

    • Suppose you change the /api/agent/{id}/feedback contract. With a service mesh, you can deploy version 2 of the Feedback service behind a “v2” label. The mesh can route 1 % of traffic to v2 (smoke testing), then gradually shift to 100 %. If errors spike, you can immediately roll back to v1 without redeploying all services.

3.4 Cross-Region Caching & Rate Limiting

  • Global Redis Federation

    • User session tokens and embedding caches should be local to each region for lowest latency. However, certain shared caches—like global feature flags or common vector shards—can be replicated asynchronously. Each Redis instance in Region A subscribes to a replication channel from Region B, ensuring updates propagate within seconds.
  • Rate Limiting / Throttling

    • If one region suddenly has a traffic spike (e.g., a flash sale or a viral agent prompt), you must throttle calls at the edge. Use the ingress controller’s rate-limit module to enforce, say, 500 requests per second per IP or per JWT. This prevents one region from consuming all upstream resources (like the RAG cluster).

3.5 Compliance, Backup & Disaster Recovery

  • Regional Isolation & Data Residency

    • If you have EU citizens using the platform, ensure their PII (user profile, payment history) resides in EU data centers only. On the schema level, tag every row with a region field so that your ORM layer can enforce “if region = Europe, route to eu-west-1 Postgres.”
  • Backups & Point-In-Time Recovery

    • Each region’s primary databases (Postgres, CockroachDB) should take hourly snapshots and replicate them cross-region into cold storage (S3 or Glacier). For Neo4j, schedule nightly incremental backups and store them in an immutable bucket.
    • Test your restore procedure quarterly: spin up a new cluster from backups, run integration tests, confirm service readiness.
  • Chaos Engineering

    • At this scale, plan for partial failures. Use a tool like Chaos Mesh or Gremlin to simulate:

      1. Region A Kubernetes master down
      2. Database primary crash
      3. Network partition between two regions
    • Verify that your fallback logic works: if the Leader Postgres in us-east goes down, us-west automatically speeds up its replica and becomes the new primary (and DNS is updated within 60 seconds).


4. Conclusion: Evolving the Architecture

No single blueprint fits all stages. At DAU < 10 000, the imperative is “code well, test well, keep it simple.” As traffic climbs past 100 000, you break apart monoliths into microservices, introduce caching, message queues and distributed databases. Once you exceed 1 000 000 daily active users—especially across multiple continents—you implement multi-region Kubernetes clusters, geo-load balancing, sharded vector stores and OLTP databases, a service mesh for security/observability, and robust disaster recovery processes.

At every stage, however, one truth remains: scalability is not just adding servers, it’s about making each layer replaceable and substitutable without rewriting everything below or above it. Maintainability follows naturally when you avoid tight coupling, keep code responsibilities focused, and build straightforward CI/CD pipelines that catch errors early. By treating each layer—code, service, data store, and infrastructure—as a component you can swap at will, you remain agile even when serving millions of users.

In short:

  1. Early Stage: Clean architecture, single responsibility, basic CI/CD, observability.
  2. Intermediate Stage: Service partitioning, clustered data stores, Redis caching, message queues, fine-tuned resource management.
  3. Large-Scale Stage: Multi-region deployments, geo-routing, sharded vector and OLTP clusters, service mesh, global caching, compliance and disaster recovery.

By methodically evolving your platform along these three phases—always keeping the next step in mind—you build a system that remains reliable, performant and maintainable from day one up to the point where you serve millions, globally.

Join the Discussion

Share your thoughts and insights about this system.