From ZooKeeper to Redis: Rethinking Agent Orchestration for AI-Native Systems

August 3, 2025

Intro: Real-time Agent Execution in an AI-Native World

We’re entering a new era of AI-native systems—multi-agent workflows, LLM feedback loops, and autonomous decision routing. If you're building tools where agents reason over graphs of tasks and evolve workflows over time (like OhWise does), you’ll inevitably face this question:

Where and how should you store, track, and coordinate the execution state of your DAG?

This post dives into the evolution from ZooKeeper to Redis, contrasting their coordination models, and explaining why Redis ultimately won for OhWise. We’ll also compare this model to emerging MCP-based agentic systems, and why Redis-backed DAG orchestration gives us more flexibility, speed, and agency.


The Problem: Multi-Agent Execution State Store

Let’s say your orchestrator parses a user request into this DAG:

a / \ b c \ / d

Each node is an agent task—API call, model execution, I/O task, etc. Your backend must:

  • Track real-time status for each node (pending, running, success)
  • Enqueue ready nodes only
  • Notify frontend as each node finishes
  • Handle partial failures and restarts

This leads us to the State Store + Coordination Layer problem.


ZooKeeper: Battle-Tested for Distributed Coordination

DAG Encoding in ZK

/dag/123/a/status = success /dag/123/b/status = running

ZooKeeper has advantages in coordinating workers in a distributed system.

  • Watches: orchestrator reacts in real time without polling
  • Ephemeral nodes: detects dead workers via sessions
  • Consistency: strong, linearizable guarantees (via Zab protocol)

However, it has certain overheads in terms of performance

  • Complex to operate (quorum, leader election)
  • Throughput bottlenecks under high churn
  • Overkill unless you truly need distributed locks or coordination

ZooKeeper is recommended when your system needs leader election, fencing tokens, or shared locks. But it quickly becomes cumbersome for task graphs.


Redis as a Real-Time DAG Store

Pattern: Adjacency Map in Redis Hash

HSET dag:123 task:a '{"status":"success","next":["b","c"]}' HSET dag:123 task:b '{"status":"pending","next":["d"]}'

Why Redis is Right at This Stage

  • Sub-ms atomic updates per task
  • Easy for workers to HSET results
  • Orchestrator can HGETALL in one shot to check progress
  • Low memory and ops overhead—ideal for MVPs and sub-10K user scale
  • Compatible with Redis Streams for event tracking

What You Still Have to Handle Manually

  • No built-in dependency resolution—you compute this externally
  • Must implement your own retry/reconcile logic

Despite this, Redis’s combination of simplicity, speed, and developer control makes it ideal for intelligent workflows.


Final Decision: Why Redis Wins Over ZooKeeper

After evaluating both systems under real-world DAG orchestration load, Redis won for OhWise because:

  • 10x faster write/read latency than ZooKeeper
  • Simpler setup and ops—no quorum, no zkCli debugging
  • Flexible schema for embedding custom node info
  • Streams + Hashes + Pub/Sub cover 90% of coordination needs
  • Easier to integrate with LLM agents, self-healing logic, and WebSocket-based UIs

ZooKeeper is designed for distributed consensus. Redis is designed for fast, reactive, real-time systems.

For AI-native DAG workflows where agents make decisions, mutate graphs, and operate asynchronously, Redis offers a cleaner and more maintainable mental model.


Comparison with MCP-Based Agent Systems

MCP (Model Context Protocol), introduced by Anthropic in 2024, allows LLMs to call tools via standard interfaces (JSON-RPC). It’s a step toward standardizing how agents talk to tools. But it’s not an orchestrator.

MCP Limitations for DAG Execution

  • No native task dependency handling or dynamic scheduling
  • No orchestration memory or state tracking
  • Doesn’t support backpressure, retries, or mutation-based DAG evolution

OhWise Advantage

  • Orchestrator has full control of DAG graph, status, and self-evolution
  • Redis for fast state transitions; MariaDB for versioned storage
  • Multiple agents can contribute to one evolving DAG over time
  • Designed for long-lived agent sessions—not stateless invocation

MCP is like an RPC bus. Redis + OhWise is like a DAG brain.


Race Conditions, Failures & Recovery

Let's think about 2 scenarios:

  • Worker dies after updating Redis but before notifying orchestrator?
  • Orchestrator crashes before reacting to completion?

Solution: Use Redis Streams + Periodic Reconcile

  • Workers emit XADD task_complete events

  • Orchestrator blocks on XREADGROUP to react in near-real-time

  • A separate reconcile job runs every 30–60s:

    • Looks for "running" tasks with old timestamps
    • Re-enqueues or retries them

This pattern ensures fault tolerance without tight coupling.


Redis vs ZooKeeper vs SQL: The Matrix

FeatureRedisZooKeeperMariaDB
Read latencySub-ms1–5ms100–300ms
Write throughputHighLimitedDecent
Built-in watchersManual pollingYesNo
Coordination logicDIYBuilt-inExternalized
DurabilityConfigurableStrongACID
Operational costLightweightComplexCommodity

Conclusion

Use Redis for real-time task graph state: fast, flexible, observable. Use MariaDB for DAG versioning and historical insight. ZooKeeper is still useful for consensus and leader election—but it’s not worth the operational cost unless you truly need it.

Want low-latency fan-out? Use Redis Pub/Sub + Socket.IO to stream back to your frontend. Want self-healing? Use a cron-based reconcile job to replay or resume stuck tasks.


The Future: Self-Evolving DAGs and Agent-Driven Orchestration

Your orchestrator shouldn't just dispatch—it should learn.

Next steps:

  • Add an LLM-powered audit agent that reads past DAGs and recommends structure improvements
  • Use embedding-based matching to route user prompts to similar past DAGs
  • Let agents propose mutations to the DAG and evaluate them over time

Soon, orchestration won’t just be a pipeline engine—it’ll be an evolving memory structure.


This is not a dev tool. It’s infrastructure for intelligent systems.

If you're building systems with autonomous agents, graph reasoning, or real-time pipelines—connect with me at heunify.com/contact

Join the Discussion

Share your thoughts and insights about this project.