Stay Hungry,Stay Foolish!

Temporal + LangGraph: A Two-Layer Architecture for Multi-Agent Coordination

Temporal + LangGraph: A Two-Layer Architecture for Multi-Agent Coordination

https://www.anup.io/temporal-langgraph-a-two-layer-architecture-for-multi-agent-coordination/

image

 

The Two-Layer Architecture

Keeping these layers separate turned out to be critical.

Temporal: The Orchestration Layer

Temporal is a workflow engine. You write workflows as code. Temporal handles persisting state between steps, retrying failed operations with backoff, enforcing timeouts, and letting you query what's happening. If your worker crashes mid-workflow, Temporal picks up where it left off.

Think of Temporal workflows as deterministic replay machines. That framing helped me understand them. They don't run your code directly. Instead, they record what happened and replay from any checkpoint. That's why they're good for orchestrating unreliable things like LLM calls.

LangGraph: The Agent Layer

LangGraph gives you state machines for agent logic. You define nodes (functions), edges (transitions), and a state schema. It handles running nodes in parallel when you want that. It's built around the idea that agents accumulate state as they work.

I use TypedDict for state schemas. You get autocomplete and it catches typos. More on that later.

Why Keep Them Separate?

Temporal answers "did this complete, and if not, what do we do about it?" LangGraph answers "given this state, what should the agent do next?" These are different questions.

I tried combining them early on. It got messy fast. Retry logic bled into agent logic. I couldn't tell where state lived. Separating them cleaned everything up.

But wait, doesn't LangGraph have its own durable execution now?

Yes. LangGraph 1.0 (released October 2025) includes built-in persistence and durable execution. So why use Temporal?

Temporal is battle-tested across thousands of production deployments for mission-critical workflows. It gives you superior observability through the Temporal Web UI, native support for workflows spanning days or weeks, and a proven track record handling infrastructure failures. LangGraph's durable execution is newer and purpose-built for AI agents, but Temporal's maturity matters when you need rock-solid reliability.

Grid Dynamics case study validates this: they migrated from LangGraph-only to Temporal after finding that LangGraph's Redis-based state management created issues with lifecycle management and debugging. Temporal's event history made state persistence automatic and debugging straightforward.

image

 

 

 

Temporal and OpenAI Launch AI Agent Durability with Public Preview Integration

https://www.infoq.com/news/2025/09/temporal-aiagent/

https://github.com/temporalio/sdk-python/blob/main/temporalio/contrib/openai_agents/README.md

Temporal has unveiled a public preview integration with the OpenAI Agents SDK, introducing durable execution capabilities to AI agent workflows built using OpenAI's framework. This collaboration enables developers to build AI agents that automatically handle real-world operational challenges, such as LLM rate limits, network disruptions, and unexpected crashes, without adding complexity to their code.

At the core of this integration is Temporal’s strength in orchestrating distributed, fault-tolerant systems. OpenAI agents, when wrapped in Temporal workflows, benefit from built-in retry logic, state persistence, and crash recovery, allowing developers to define the "happy path" and rely on Temporal to manage error handling and workflow consistency.

Traditionally, AI agents, whether built with LangChainLlamaIndex, or the OpenAI SDK, run as stateless processes, meaning a failure mid-execution forces a complete restart and wastes compute and token costs. With Temporal, every agent interaction, including large language model (LLM) calls, tool executions, and external API requests, is captured as part of a deterministic workflow. This approach allows the system to automatically replay and restore the agent’s exact state after a crash, timeout, or network failure, dramatically increasing reliability and operational efficiency.

 

https://github.com/temporalio/sdk-python/blob/main/temporalio/contrib/openai_agents/README.md

In Temporal's durable execution implementation, a program that crashes or encounters an exception while interacting with a model or API will retry until it can successfully complete.

Temporal relies primarily on a replay mechanism to recover from failures. As the program makes progress, Temporal saves key inputs and decisions, allowing a re-started program to pick up right where it left off.

The key to making this work is to separate the applications repeatable (deterministic) and non-repeatable (non-deterministic) parts:

  1. Deterministic pieces, termed workflows, execute the same way when re-run with the same inputs.
  2. Non-deterministic pieces, termed activities, can run arbitrary code, performing I/O and any other operations.

Workflow code can run for extended periods and, if interrupted, resume exactly where it left off. Activity code faces no restrictions on I/O or external interactions, but if it fails part-way through it restarts from the beginning.

In the AI-agent example above, model invocations and tool calls run inside activities, while the logic that coordinates them lives in the workflow. This pattern generalizes to more sophisticated agents. We refer to that coordinating logic as agent orchestration.

As a general rule, agent orchestration code executes within the Temporal workflow, whereas model calls and any I/O-bound tool invocations execute as Temporal activities.

The diagram below shows the overall architecture of an agentic application in Temporal. The Temporal Server is responsible to tracking program execution and making sure associated state is preserved reliably (i.e., stored to a database, possibly replicated across cloud regions). Temporal Server manages data in encrypted form, so all data processing occurs on the Worker, which runs the workflow and activities.

            +---------------------+
            |   Temporal Server   |      (Stores workflow state,
            +---------------------+       schedules activities,
                     ^                    persists progress)
                     |
        Save state,  |   Schedule Tasks,
        progress,    |   load state on resume
        timeouts     |
                     |
+------------------------------------------------------+
|                      Worker                          |
|   +----------------------------------------------+   |
|   |              Workflow Code                   |   |
|   |       (Agent Orchestration Loop)             |   |
|   +----------------------------------------------+   |
|          |          |                |               |
|          v          v                v               |
|   +-----------+ +-----------+ +-------------+        |
|   | Activity  | | Activity  | |  Activity   |        |
|   | (Tool 1)  | | (Tool 2)  | | (Model API) |        |
|   +-----------+ +-----------+ +-------------+        |
|         |           |                |               |
+------------------------------------------------------+
          |           |                |
          v           v                v
      [External APIs, services, databases, etc.]

 

from dataclasses import dataclass
from datetime import timedelta
from temporalio import activity, workflow
from temporalio.contrib import openai_agents
from agents import Agent, Runner

@dataclass
class Weather:
    city: str
    temperature_range: str
    conditions: str

@activity.defn
async def get_weather(city: str) -> Weather:
    """Get the weather for a given city."""
    return Weather(city=city, temperature_range="14-20C", conditions="Sunny with wind.")

@workflow.defn
class WeatherAgent:
    @workflow.run
    async def run(self, question: str) -> str:
        agent = Agent(
            name="Weather Assistant",
            instructions="You are a helpful weather agent.",
            tools=[
                openai_agents.workflow.activity_as_tool(
                    get_weather,
                    start_to_close_timeout=timedelta(seconds=10)
                )
            ],
        )
        result = await Runner.run(starting_agent=agent, input=question)
        return result.final_output

 

https://code2life.top/blog/0070-temporal-notes

编排的本质是什么?

要理解编排,可以借助和Orchestration对应的另一个概念:Choreography。找不到合适的中文翻译,还是看图理解吧:

举个例子,我们开发微服务时,经常借助消息队列(MQ)做事件驱动的业务逻辑,实现最终一致的、跨多个服务的数据流,这属于Choreography。而一旦引入了MQ,可能会遇到下面一系列问题:

  • 消息时序问题
  • 重试幂等问题
  • 事件和消息链路追踪问题
  • 业务逻辑过于分散的问题
  • 数据已经不一致的校正对账问题
  • ...

在复杂微服务系统中,MQ是一个很有用的组件,但MQ不是银弹,这些问题经历过的人会懂。如果过度依赖类似MQ的方案事件驱动,但又没有足够强大的消息治理方案,整个分布式系统将嘈杂不堪,难以维护。

如果转换思路,找一个“调度主体”,让所有消息的流转,都由这个"指挥家"来控制怎么样呢?对,这就是Orchestration的含义。

  • Choreography 是无界上下文,去中心化,每个组件只关注和发布自己的事件,完全异步,注重的是解耦;
  • Orchestration 是有界上下文,存在全局编排者,从全局建模成状态机,注重的是内聚。

Temporal的所有应用场景,都是有全局上下文、高内聚的「编排」场景。比如BPM有明确的流程图,DevOps和BigData Pipeline有明确的DAG,长活事务有明确的执行和补偿流程。

Temporal让我们像写正常的代码一样,可以写一段工作流代码,但并不一定是在本机执行,哪一行在什么时间yield,由服务端信令统一控制,很多分布式系统韧性问题也被封装掉了,比如,分布式锁、宕机导致的重试失败、过期重试导致的数据错误,并发消息的处理时间差问题等等。

Temporal关键概念

  1. Workflow,Workflow是在编排层的关键概念,每种类型是注册到服务端的一个WorkflowType,每个WorkflowType可以创建任意多的运行实例,即WorkflowExecution,每个Execution有唯一的WorkflowID,如果是Cron/Continue-as-New, 每次执行还会有唯一的RunID。Workflow可以有环,可以嵌套子工作流(ChildWorkflow);
  2. Activity,Workflow所编排的对象主要就是Activity,编排Activity就行正常写代码一样,可以用if / for 甚至 while(true) 等各种逻辑结构来调用Activity方法,只要具备确定性即可;
  3. Signal,对于正在运行的WorkflowExecution,可以发送携带参数的信号,Workflow中可以等待或根据条件处理信号,动态控制工作流的执行逻辑。

下图是Temporal Dashboard中一个Workflow的执行详情示例。

 

 

POC: Temporal + LangGraph Integration

https://github.com/fanqingsong/temporal-langgraph-poc

Candidate: Deep Research Agent Workflow

A proof-of-concept research assistant that demonstrates how to combine Temporal workflow orchestration with LangGraph's intelligent graph-based workflow management for building robust AI agent workflows.

Motivation

Building production AI agent workflows requires both intelligent decision-making and distributed reliability:

  • LangGraph: Excellent for AI agent logic but core is not distributed - runs on single processes without built-in scaling
  • Temporal: World-class distributed orchestration that can make any workflow distributed

This POC shows how Temporal distributes LangGraph workflows to achieve intelligent AND scalable AI systems:

🤖 LangGraph (Intelligence)  +  ⚡ Temporal (Distribution)  =  🚀 Scalable AI Workflows
 

Architecture

🔄 Integration Pattern

Temporal Level: Orchestrates three main activities with durable execution, fault tolerance, and parallelization

LangGraph Level: Each Temporal activity contains its own StateGraph with nodes, edges, and conditional logic

 
 

 

https://github.com/fanqingsong/pydantic-ai-demos/tree/main

https://github.com/fanqingsong/temporal-data-pipeline-demo

https://github.com/fanqingsong/temporal-deep-research-demo

https://github.com/fanqingsong/temporal-ai-agent

https://github.com/temporal-community/openai-agents-demos

samples-python/langchain at main · temporalio/samples-python · GitHub

 

Temporal Parallel Child Workflows

https://www.danielcorin.com/til/temporal/parallel-child-workflows/

https://docs.temporal.io/develop/python/child-workflows

 

posted @ 2026-01-25 20:32  lightsong  阅读(28)  评论(0)    收藏  举报
千山鸟飞绝,万径人踪灭