Temporal + LangGraph: A Two-Layer Architecture for Multi-Agent Coordination
Temporal + LangGraph: A Two-Layer Architecture for Multi-Agent Coordination
https://www.anup.io/temporal-langgraph-a-two-layer-architecture-for-multi-agent-coordination/

The Two-Layer Architecture
Keeping these layers separate turned out to be critical.
Temporal: The Orchestration Layer
Temporal is a workflow engine. You write workflows as code. Temporal handles persisting state between steps, retrying failed operations with backoff, enforcing timeouts, and letting you query what's happening. If your worker crashes mid-workflow, Temporal picks up where it left off.
Think of Temporal workflows as deterministic replay machines. That framing helped me understand them. They don't run your code directly. Instead, they record what happened and replay from any checkpoint. That's why they're good for orchestrating unreliable things like LLM calls.
LangGraph: The Agent Layer
LangGraph gives you state machines for agent logic. You define nodes (functions), edges (transitions), and a state schema. It handles running nodes in parallel when you want that. It's built around the idea that agents accumulate state as they work.
I use TypedDict for state schemas. You get autocomplete and it catches typos. More on that later.
Why Keep Them Separate?

Temporal answers "did this complete, and if not, what do we do about it?" LangGraph answers "given this state, what should the agent do next?" These are different questions.
I tried combining them early on. It got messy fast. Retry logic bled into agent logic. I couldn't tell where state lived. Separating them cleaned everything up.
But wait, doesn't LangGraph have its own durable execution now?
Yes. LangGraph 1.0 (released October 2025) includes built-in persistence and durable execution. So why use Temporal?
Temporal is battle-tested across thousands of production deployments for mission-critical workflows. It gives you superior observability through the Temporal Web UI, native support for workflows spanning days or weeks, and a proven track record handling infrastructure failures. LangGraph's durable execution is newer and purpose-built for AI agents, but Temporal's maturity matters when you need rock-solid reliability.
A Grid Dynamics case study validates this: they migrated from LangGraph-only to Temporal after finding that LangGraph's Redis-based state management created issues with lifecycle management and debugging. Temporal's event history made state persistence automatic and debugging straightforward.

Temporal and OpenAI Launch AI Agent Durability with Public Preview Integration
https://www.infoq.com/news/2025/09/temporal-aiagent/
https://github.com/temporalio/sdk-python/blob/main/temporalio/contrib/openai_agents/README.md
Temporal has unveiled a public preview integration with the OpenAI Agents SDK, introducing durable execution capabilities to AI agent workflows built using OpenAI's framework. This collaboration enables developers to build AI agents that automatically handle real-world operational challenges, such as LLM rate limits, network disruptions, and unexpected crashes, without adding complexity to their code.
At the core of this integration is Temporal’s strength in orchestrating distributed, fault-tolerant systems. OpenAI agents, when wrapped in Temporal workflows, benefit from built-in retry logic, state persistence, and crash recovery, allowing developers to define the "happy path" and rely on Temporal to manage error handling and workflow consistency.
Traditionally, AI agents, whether built with LangChain, LlamaIndex, or the OpenAI SDK, run as stateless processes, meaning a failure mid-execution forces a complete restart and wastes compute and token costs. With Temporal, every agent interaction, including large language model (LLM) calls, tool executions, and external API requests, is captured as part of a deterministic workflow. This approach allows the system to automatically replay and restore the agent’s exact state after a crash, timeout, or network failure, dramatically increasing reliability and operational efficiency.
https://github.com/temporalio/sdk-python/blob/main/temporalio/contrib/openai_agents/README.md
In Temporal's durable execution implementation, a program that crashes or encounters an exception while interacting with a model or API will retry until it can successfully complete.
Temporal relies primarily on a replay mechanism to recover from failures. As the program makes progress, Temporal saves key inputs and decisions, allowing a re-started program to pick up right where it left off.
The key to making this work is to separate the applications repeatable (deterministic) and non-repeatable (non-deterministic) parts:
- Deterministic pieces, termed workflows, execute the same way when re-run with the same inputs.
- Non-deterministic pieces, termed activities, can run arbitrary code, performing I/O and any other operations.
Workflow code can run for extended periods and, if interrupted, resume exactly where it left off. Activity code faces no restrictions on I/O or external interactions, but if it fails part-way through it restarts from the beginning.
In the AI-agent example above, model invocations and tool calls run inside activities, while the logic that coordinates them lives in the workflow. This pattern generalizes to more sophisticated agents. We refer to that coordinating logic as agent orchestration.
As a general rule, agent orchestration code executes within the Temporal workflow, whereas model calls and any I/O-bound tool invocations execute as Temporal activities.
The diagram below shows the overall architecture of an agentic application in Temporal. The Temporal Server is responsible to tracking program execution and making sure associated state is preserved reliably (i.e., stored to a database, possibly replicated across cloud regions). Temporal Server manages data in encrypted form, so all data processing occurs on the Worker, which runs the workflow and activities.
+---------------------+
| Temporal Server | (Stores workflow state,
+---------------------+ schedules activities,
^ persists progress)
|
Save state, | Schedule Tasks,
progress, | load state on resume
timeouts |
|
+------------------------------------------------------+
| Worker |
| +----------------------------------------------+ |
| | Workflow Code | |
| | (Agent Orchestration Loop) | |
| +----------------------------------------------+ |
| | | | |
| v v v |
| +-----------+ +-----------+ +-------------+ |
| | Activity | | Activity | | Activity | |
| | (Tool 1) | | (Tool 2) | | (Model API) | |
| +-----------+ +-----------+ +-------------+ |
| | | | |
+------------------------------------------------------+
| | |
v v v
[External APIs, services, databases, etc.]
from dataclasses import dataclass from datetime import timedelta from temporalio import activity, workflow from temporalio.contrib import openai_agents from agents import Agent, Runner @dataclass class Weather: city: str temperature_range: str conditions: str @activity.defn async def get_weather(city: str) -> Weather: """Get the weather for a given city.""" return Weather(city=city, temperature_range="14-20C", conditions="Sunny with wind.") @workflow.defn class WeatherAgent: @workflow.run async def run(self, question: str) -> str: agent = Agent( name="Weather Assistant", instructions="You are a helpful weather agent.", tools=[ openai_agents.workflow.activity_as_tool( get_weather, start_to_close_timeout=timedelta(seconds=10) ) ], ) result = await Runner.run(starting_agent=agent, input=question) return result.final_output
https://code2life.top/blog/0070-temporal-notes
编排的本质是什么?
要理解编排,可以借助和Orchestration对应的另一个概念:Choreography。找不到合适的中文翻译,还是看图理解吧:

举个例子,我们开发微服务时,经常借助消息队列(MQ)做事件驱动的业务逻辑,实现最终一致的、跨多个服务的数据流,这属于Choreography。而一旦引入了MQ,可能会遇到下面一系列问题:
- 消息时序问题
- 重试幂等问题
- 事件和消息链路追踪问题
- 业务逻辑过于分散的问题
- 数据已经不一致的校正对账问题
- ...
在复杂微服务系统中,MQ是一个很有用的组件,但MQ不是银弹,这些问题经历过的人会懂。如果过度依赖类似MQ的方案事件驱动,但又没有足够强大的消息治理方案,整个分布式系统将嘈杂不堪,难以维护。
如果转换思路,找一个“调度主体”,让所有消息的流转,都由这个"指挥家"来控制怎么样呢?对,这就是Orchestration的含义。
- Choreography 是无界上下文,去中心化,每个组件只关注和发布自己的事件,完全异步,注重的是解耦;
- Orchestration 是有界上下文,存在全局编排者,从全局建模成状态机,注重的是内聚。
Temporal的所有应用场景,都是有全局上下文、高内聚的「编排」场景。比如BPM有明确的流程图,DevOps和BigData Pipeline有明确的DAG,长活事务有明确的执行和补偿流程。
Temporal让我们像写正常的代码一样,可以写一段工作流代码,但并不一定是在本机执行,哪一行在什么时间yield,由服务端信令统一控制,很多分布式系统韧性问题也被封装掉了,比如,分布式锁、宕机导致的重试失败、过期重试导致的数据错误,并发消息的处理时间差问题等等。
Temporal关键概念
- Workflow,Workflow是在编排层的关键概念,每种类型是注册到服务端的一个WorkflowType,每个WorkflowType可以创建任意多的运行实例,即WorkflowExecution,每个Execution有唯一的WorkflowID,如果是Cron/Continue-as-New, 每次执行还会有唯一的RunID。Workflow可以有环,可以嵌套子工作流(ChildWorkflow);
- Activity,Workflow所编排的对象主要就是Activity,编排Activity就行正常写代码一样,可以用if / for 甚至 while(true) 等各种逻辑结构来调用Activity方法,只要具备确定性即可;
- Signal,对于正在运行的WorkflowExecution,可以发送携带参数的信号,Workflow中可以等待或根据条件处理信号,动态控制工作流的执行逻辑。
下图是Temporal Dashboard中一个Workflow的执行详情示例。

POC: Temporal + LangGraph Integration
https://github.com/fanqingsong/temporal-langgraph-poc
A proof-of-concept research assistant that demonstrates how to combine Temporal workflow orchestration with LangGraph's intelligent graph-based workflow management for building robust AI agent workflows.
Building production AI agent workflows requires both intelligent decision-making and distributed reliability:
- LangGraph: Excellent for AI agent logic but core is not distributed - runs on single processes without built-in scaling
- Temporal: World-class distributed orchestration that can make any workflow distributed
This POC shows how Temporal distributes LangGraph workflows to achieve intelligent AND scalable AI systems:
🤖 LangGraph (Intelligence) + ⚡ Temporal (Distribution) = 🚀 Scalable AI Workflows
Temporal Level: Orchestrates three main activities with durable execution, fault tolerance, and parallelization
LangGraph Level: Each Temporal activity contains its own StateGraph with nodes, edges, and conditional logic
https://github.com/fanqingsong/pydantic-ai-demos/tree/main
https://github.com/fanqingsong/temporal-data-pipeline-demo
https://github.com/fanqingsong/temporal-deep-research-demo
https://github.com/fanqingsong/temporal-ai-agent
https://github.com/temporal-community/openai-agents-demos
samples-python/langchain at main · temporalio/samples-python · GitHub
Temporal Parallel Child Workflows
https://www.danielcorin.com/til/temporal/parallel-child-workflows/
https://docs.temporal.io/develop/python/child-workflows

浙公网安备 33010602011771号