Why Golem – Golem Cloud

Why Golem?

Because agents are distributed systems.

Most "agent frameworks" stop at prompt chaining. Real agents run for minutes to months, cross many tools and APIs, pause for people, and must never lose state or double-fire.

Golem is an agent-native runtime that gives you durability, exactly-once effects, full observability, and suspend-to-zero — without the need to build queues, schedulers, or idempotency plumbing.

What You Get (Day One)

Production Guarantees Built In

Durable state: After a crash or restart, an agent resumes at the last successful point—no lost context or rework.
Exactly-once external effects: Retries won't double-charge cards, re-open tickets, or post duplicate webhooks.
Durable internal delivery: Agent-to-agent messages are persisted and delivered once.
Suspend & resume: Park idle agents (human approval, scheduled wake, long waits) at zero compute; resume instantly.
Automatic recovery & back-off: Transient failures are retried to policy; agents are rescheduled on healthy nodes.
Isolation & permissions: Per-agent sandboxes and scoped capabilities prevent cross-talk or privilege bleed.
Observability with rewind: Inspect the full history of prompts, tools, and data and replay safely—no duplicate effects.
Zero-downtime upgrades: Run versions side-by-side or migrate live agents forward.

The Code You Write; The Runtime You Don't

Use normal TypeScript to write agents and tools. Beneath it, Golem runs a complete runtime—persistence, exactly-once I/O, observability, autoscaling, isolation, and safe upgrades—so your agents survive reality without bespoke infra.

What You Stop Building

Durable queues & outboxes
Idempotency keys & dedupe logic
Cron catch-ups & retry schedulers
Long-poll loops & timeout workarounds
Ad-hoc audit logs & incident recon tooling
DIY "workflow engines" glued to lambdas

Frameworks chain steps; Golem makes those steps durable, replayable, and exactly-once.

Where Golem Shines

Use Cases

Human-in-the-Loop Workflows Agents pause for review, capture the outcome, and resume with full state—no repeated actions.
Intelligent Incident Response Investigate, coordinate remediation, and escalate without losing track or duplicating fixes.
Reliable Data & Transactions Across Systems Synchronize CRMs/ERPs/APIs/payments with exactly-once guarantees.
Personalized Content & Offers at Scale Assemble and deliver tailored experiences with no duplicates or misses—even during spikes.
Multi-Agent Research & Analysis Collaborating agents that run for hours to weeks, preserving context and avoiding re-work.
Continuous Risk Monitoring & Alerting Real-time detection and actions without false repeats or missed alerts.

If your workflows are long-running, stateful, cross many tools, or HITL, Golem turns fragile prototypes into production-grade agents.

Is Golem a Fit for You?

✅ Strong Fit If You Need:

Agents that must never lose progress or repeat side-effects
HITL pauses, external API waits, or long-tail tasks (minutes → months)
Auditable execution with full traces of prompts, tools, and data
Multi-agent coordination with guaranteed messaging
To ship now without building durable infra yourself

❌ Probably Not a Fit Right Now If:

Your app is stateless or completes in a single request/response
You only need a prompt library or a light orchestration DSL and reliability doesn't matter to you
You require a programming language other than TypeScript today (more languages are on the way)
You already run on a durable runtime that guarantees exactly-once and you're satisfied with its ops/tooling

Our goal is to help you decide quickly — even if the answer is "not yet."

How Golem Fits Your Stack

Build agents and tools in TypeScript.
Grant capabilities (APIs, tools, other agents).
Deploy agent types to the runtime.
Trigger via secure APIs; send commands or messages.
Observe execution with searchable traces and safe replay.

Interop: Agents call your services over HTTP under exactly-once mediation. Connect tools (including MCP) and other agents with durable delivery. Start with one use case; retire bespoke queues and schedulers over time.

Why a Runtime (Not Just a Library)?

Libraries help you write flows; runtimes keep them correct under failures, retries, and upgrades.

Failure semantics: Exactly-once effects and durable delivery are properties of the runtime, not application glue.
State correctness: Snapshots + durable logs let agents resume exactly where they left off.
Operate at scale: Millions of agents can suspend to zero and wake as events arrive—without cost spikes.
Debug the truth: Full, replayable traces beat partial logs stitched across services.

If you like, use your favorite agent libraries for control flow; but run them on Golem for durability and correctness.

What Changes in Your Day-to-Day

Fewer incidents about "what ran twice?" or "where did we lose state?"
Faster iteration—upgrade logic without draining in-flight agents
Straight-line agent code instead of scattered jobs and compensations
Real root-cause analysis at 2 a.m. with full, searchable history

FAQs

Do I have to re-write everything?

No. Start by running the stateful, failure-sensitive parts of your agents on Golem. Keep your existing APIs and data stores.

What languages are supported?

TypeScript today, with more languages on the way.

How does Golem prevent duplicate effects?

An I/O gateway mediates external calls with durable logs and idempotent commits to ensure exactly-once behavior—even under retries and failover.

Can agents pause for humans or slow APIs?

Yes. Agents suspend & resume with full state intact and zero compute while waiting.

Next Steps

Check the Quickstart guide to learn how to install and build your first Golem agent
Explore the Develop section to learn in details how to write Golem agents
The Usage section contains detailed information about deploying, invoking and debugging agents, using the command line interface and more.

Agents that don't forget. Actions that don't repeat. Ship agents you can trust—without rebuilding distributed systems infrastructure.

Home Fundamentals