Cloud & Infrastructure · Workflow Orchestration
Temporal for Durable Workflows: How We Finally Stopped Losing Background Jobs
Background jobs that crash mid-execution lose all their state. Temporal solves this by making workflows durable state machines that survive process restarts, deploys, and outages. Here's what it looks like in TypeScript and Python.
Anurag Verma
7 min read
Sponsored
A client comes to you with a problem: their order fulfillment pipeline occasionally drops orders. The process sends an email, charges the card, creates a shipping label, and updates inventory. If the server crashes between step two and step three, the card is charged but no label is created. Support has to manually sort it out.
The standard fix is a job queue with retry logic. You enqueue tasks, workers pick them up, and if a task fails it gets retried. This works until the failure mode is “process crashed in the middle of an activity” rather than “the activity returned an error.” A worker that dies mid-execution doesn’t leave a retrievable state. The task is gone, or worse, it runs again from the beginning. Now the card is charged twice.
Temporal solves this at the architecture level. Instead of queuing discrete tasks, you write workflows as code. Temporal makes those workflows durable: if the process crashes at any point, the workflow resumes exactly where it left off when the worker comes back.
How Temporal Works
Temporal separates workflows from activities:
Workflows are the orchestration logic. They define the sequence of steps, handle retries, and maintain state. Workflow code must be deterministic: no random numbers, no system time, no I/O.
Activities are the actual side effects. Charging a card, sending an email, calling an API. Activities can do anything. They can fail, and Temporal will retry them according to a policy you define.
The Temporal server persists a full event history of every workflow. When a worker restarts, it replays this history to reconstruct the current state of every in-flight workflow. Your code runs again from the beginning, but Temporal intercepts every “completed activity” step and returns the cached result. The actual side effects don’t repeat.
Workflow: ProcessOrder
│
├── Activity: ValidateOrder ← runs once, result cached
├── Activity: ChargeCard ← runs once, result cached
├── Activity: CreateShipment ← CRASH HERE
│ Worker restarts
│ Replay: ValidateOrder → cached
│ Replay: ChargeCard → cached
│ CreateShipment → actually executes again
├── Activity: UpdateInventory
└── Activity: SendConfirmation
The card is not charged twice. The replay knows ChargeCard already completed.
Getting Started: TypeScript
npm install @temporalio/client @temporalio/worker @temporalio/workflow @temporalio/activity
Run a local Temporal server for development:
npx @temporalio/create@latest temporal-dev
# or with docker
docker run -p 7233:7233 temporalio/auto-setup
Define Activities
Activities are plain async functions. They live in their own file because Temporal sandboxes workflow code separately.
// src/activities.ts
import { ApplicationFailure } from '@temporalio/activity';
export async function validateOrder(orderId: string): Promise<{ valid: boolean; amount: number }> {
const order = await db.orders.findById(orderId);
if (!order) {
// Non-retryable: this order doesn't exist
throw ApplicationFailure.nonRetryable(`Order ${orderId} not found`);
}
return { valid: true, amount: order.total };
}
export async function chargeCard(orderId: string, amount: number): Promise<string> {
const result = await stripe.paymentIntents.create({
amount,
currency: 'usd',
metadata: { orderId },
});
return result.id;
}
export async function createShipment(orderId: string): Promise<string> {
const label = await shippo.transactions.create({ orderId });
return label.trackingNumber;
}
export async function updateInventory(orderId: string): Promise<void> {
await db.orders.updateStatus(orderId, 'shipped');
}
export async function sendConfirmation(orderId: string, trackingNumber: string): Promise<void> {
await mailer.send({ to: await db.orders.getEmail(orderId), trackingNumber });
}
Define the Workflow
Workflow code must be deterministic. Use Temporal’s proxyActivities to call activities. Temporal intercepts these calls and makes them durable.
// src/workflows.ts
import { proxyActivities, sleep } from '@temporalio/workflow';
import type * as activities from './activities';
const { validateOrder, chargeCard, createShipment, updateInventory, sendConfirmation } =
proxyActivities<typeof activities>({
startToCloseTimeout: '30 seconds',
retry: {
initialInterval: '1 second',
backoffCoefficient: 2,
maximumAttempts: 5,
},
});
export async function processOrderWorkflow(orderId: string): Promise<void> {
const { valid, amount } = await validateOrder(orderId);
if (!valid) {
return;
}
const paymentIntentId = await chargeCard(orderId, amount);
const trackingNumber = await createShipment(orderId);
await updateInventory(orderId);
await sendConfirmation(orderId, trackingNumber);
}
Start a Worker
// src/worker.ts
import { Worker } from '@temporalio/worker';
import * as activities from './activities';
async function run() {
const worker = await Worker.create({
workflowsPath: require.resolve('./workflows'),
activities,
taskQueue: 'order-processing',
});
await worker.run();
}
run().catch(console.error);
Trigger the Workflow
// src/trigger.ts
import { Client } from '@temporalio/client';
import { processOrderWorkflow } from './workflows';
const client = new Client();
await client.workflow.start(processOrderWorkflow, {
taskQueue: 'order-processing',
workflowId: `order-${orderId}`, // idempotent — same ID won't start twice
args: [orderId],
});
The workflowId is your idempotency key. Calling start with the same ID while a workflow is running returns the existing execution instead of starting a new one. This means your API handler can safely retry without creating duplicate workflows.
Python SDK
Temporal has a first-class Python SDK that mirrors the TypeScript structure.
# activities.py
from temporalio import activity
from temporalio.exceptions import ApplicationError
@activity.defn
async def validate_order(order_id: str) -> dict:
order = await db.orders.find(order_id)
if not order:
raise ApplicationError(f"Order {order_id} not found", non_retryable=True)
return {"valid": True, "amount": order.total}
@activity.defn
async def charge_card(order_id: str, amount: int) -> str:
result = await stripe.create_payment_intent(amount=amount)
return result["id"]
# workflows.py
from datetime import timedelta
from temporalio import workflow
from temporalio.common import RetryPolicy
with workflow.unsafe.imports_passed_through():
from activities import validate_order, charge_card, create_shipment
@workflow.defn
class ProcessOrderWorkflow:
@workflow.run
async def run(self, order_id: str) -> None:
retry_policy = RetryPolicy(
maximum_attempts=5,
initial_interval=timedelta(seconds=1),
backoff_coefficient=2.0,
)
result = await workflow.execute_activity(
validate_order,
order_id,
start_to_close_timeout=timedelta(seconds=30),
retry_policy=retry_policy,
)
if not result["valid"]:
return
await workflow.execute_activity(
charge_card,
args=[order_id, result["amount"]],
start_to_close_timeout=timedelta(seconds=30),
retry_policy=retry_policy,
)
# ... remaining steps
When to Use Temporal vs the Alternatives
The right tool depends on what you’re building.
| Scenario | Recommendation |
|---|---|
| Simple task queue (emails, webhooks) | BullMQ (Node) or Celery (Python) |
| Scheduled jobs, cron-style | Native cron or cloud scheduler |
| Multi-step processes with retries | Temporal |
| Long-running workflows (days/weeks) | Temporal |
| Processes that need to pause and wait | Temporal |
| Complex saga patterns (distributed transactions) | Temporal |
Temporal’s overhead (running a server, learning the SDK, writing deterministic workflow code) is not worth it for simple “retry this HTTP call three times.” It pays off when you have multi-step processes where partial completion causes real problems, or when you need to orchestrate work across multiple services with guarantees about what runs exactly once.
Signals and Queries
Two features change how you think about workflow orchestration.
Signals let external code send events into a running workflow. A human approval step, an external event, a cancellation request:
// In the workflow
import { defineSignal, setHandler } from '@temporalio/workflow';
const approveSignal = defineSignal<[string]>('approve');
export async function requiresApprovalWorkflow(orderId: string): Promise<void> {
let approved = false;
setHandler(approveSignal, (approverEmail: string) => {
approved = true;
});
await condition(() => approved, '7 days'); // wait up to 7 days for approval
if (!approved) {
await cancelOrder(orderId);
return;
}
await processOrder(orderId);
}
// From anywhere (API handler, admin panel)
await client.workflow.getHandle(`order-${orderId}`).signal(approveSignal, 'manager@company.com');
Queries let you inspect running workflow state without interrupting it. Useful for status APIs that need to show “step 3 of 5, waiting for payment confirmation.”
Production Deployment
Temporal Cloud is the managed option. They run the Temporal server; you just connect workers. For most teams, this is the right call. Self-hosting the Temporal cluster adds operational burden that’s rarely worth it unless you have strict data sovereignty requirements.
Workers are stateless and horizontally scalable. Deploy as many as you need; they’ll pull work from the task queue. One Temporal cluster can serve multiple application environments if you use separate namespaces.
The Temporal web UI (included with both Cloud and self-hosted) shows every workflow execution, its state, history, and any failures. It’s the debugging tool you wish you had for your current job queue setup.
For teams building products where background processes affect money, inventory, or user accounts, Temporal replaces a category of defensive code (status checks, idempotency tables, manual recovery scripts) with a programming model where correctness is the default.
Sponsored
More from this category
More from Cloud & Infrastructure
Database Connection Pooling in 2026: PgBouncer, Supabase, and Prisma Accelerate
Secrets Management in Production: The Patterns That Actually Work
Incident Response for Small Engineering Teams: SRE Without a Dedicated Ops Team
Sponsored
The dispatch
Working notes from
the studio.
A short letter twice a month — what we shipped, what broke, and the AI tools earning their keep.
Discussion
Join the conversation.
Comments are powered by GitHub Discussions. Sign in with your GitHub account to leave a comment.
Sponsored