Why Golem?

Golem is a durable computing platform that makes it simple to build and deploy highly reliable distributed systems.

Through durable computing, workers executing on Golem are automatically fault-tolerant, update-tolerant, and flaky-tolerant.

Golem workers continue execution uninterrupted, even when hardware fails, servers go down, clusters are resized, or new updates are deployed.

Golem simplifies the development of highly-reliable, distributed applications.

Benefits

Golem shares many of the benefits of serverless computing, including faster development, lower maintenance overhead, and reduced need for operations.

Due to the fact that Golem is built on durable computing, Golem offers benefits not seen in most serverless solutions, including:

Fault-Tolerance. Golem performs real-time, incremental backups of the state of workers and supervises all workers. In the event of a fault, Golem restores failed workers to new nodes, where they continue execution where they left off.
Update-Tolerance. Golem lets you deploy new versions of your software at any time, and these new versions only affect new workers (by default). Because versions are isolated, you can update freely without breaking critical live code.
Flaky-Tolerance. While Golem is reliable, sometimes your system needs to talk to external systems. Golem performs sophisticated, user-configurable retry policies on recoverable failures, freeing you from the drudgery of retries.

These benefits allow you to easily build highly-reliable distributed systems, without complex architecture or infrastructure.

Example Use Cases

In the following examples, you will learn how Golem simplifies the architecture and development of several different types of distributed applications.

Order Processing

In an online storefront, we may have to take many steps during end-to-end processing of an order:

Check and reserve sufficient inventory.
Validate details of the order, such as shipping address.
Validate payment method details.
Charge the user's credit card or debit their bank account.
Retry or abort on an failed charge (as appropriate), or continue on success.
Record the successful charge in a database.
Dispatch the order to fulfillment.
Send the customer an email with their order details.
Send the user updates as the state of the order is updated by fulfillment.
Handle unexpected errors during fulfillment, or the user's cancellation of the order before dispatch.

The major part of this process can take several seconds to several minutes, while the entire process could take days or even weeks.

In a perfect world, we could write a simple function that performs all of these step in order, in a handful of helper functions that call out to other web services.

However, if we were to push this naive solution into production, we would run into serious problems. Due to faults, updates, and upgrades, there would exist some customers whose order processing was interrupted midway.

When running code is interrupted midway, it often leaves the overall system in an inconsistent state. The user might be surprised to see they were charged for the items that will never be delivered, because the function that was processing the order was interrupted midway.

In order to compensate for these failure scenarios, we might adopt an event-driven architecture. Because events are stored persistently in a durable message queue like Apache Kafka or Apache Pulsar, we would then have a reliable means to detect which steps have been performed, and which have not. We could write processes that recover partially completed orders, and resume at the point where they left off.

While industry-proven and reliable, this approach ends up significantly increasing our development and maintenance costs, and turns what could be a simple and short program into many disparate services, all introducing additional overhead and points of failure.

With Golem, the entire order processing process can be written and deployed as a serverless function. When an API triggers an invocation of the function, the worker will run transactionally, regardless of faults, updates, or flakiness. This automatic reliability lets us focus on business logic, rather than the complexities of designing and maintaining an event-driven architecture.

Session Management

In a multiplayer online betting game, each gaming session could last several minutes to an hour, and depending on the nature of the game, players may be able to place bets days or weeks in advance of the live session.

The responsibilities of session management include:

Financial Transaction Processing. Taking bets and settlement, which involve traditional commerce processes such as the preceding storefront example.
State Management. Managing the state of the game session during play, as game play unfolds.

Due to the financial nature of betting games, it is important that bets be tracked precisely and placed only be allowed when the state of the game and the rules of the game permit the bets to be placed. Moreover, faults can occur at any point and should never halt the progress of a game nor interfere with bets that have been placed so far.

One way of solving this problem would be to use a relational database in order to obtain ACID guarantees on bets. However, the state of the game evolves in real-time, so to make concurrent bets in the presence of a changing game state, it would also be necessary to store the game state in the database. Storing both all bets and the game state itself in the relational database could negatively impact performance, necessitating alternate approaches, such as event-sourcing with CQRS.

In the end, a traditional solution to the problem of session management is likely to be highly tailored to the specific nature of the game and the audience size, and likely to involve some kind of event-sourcing, potentially with CQRS, in order to ensure consistency and recoverability in the event of unanticipated failures.

With Golem, the entire game session can be deployed as a single serverless worker, which responds to updates in game state and takes new bets. Since the data is stored in memory and latency of workers in Golem is very low, performance can be very high, without sacrificing the durability guarantees of other approaches based on ACID databases or event-sourcing.

AI Agent

When implementing an AI agent, each step of the AI agent may translate into multiple different models, some of which may have to be repeated if they do not yield a satisfactory answer.

In addition, many AI agents are stateful, and must accumulate the total information they have produced from the beginning of time until the current moment in time, as this information assists subsequent step formation and execution.

Each execution of a model takes time and resources, which equates to higher latency or higher costs. Due to these realities, it is often a technical requirement for AI agents to execute durably, such that they are fault-tolerant.

Implementing fault-tolerant AI agents can be done using state machines and event-sourcing, together with durable queues and key/value stores.

Golem provides an alternative: each executing AI agent can be a single worker, which can naturally accumulate state in memory, without the need for queues or databases.

Because Golem workers are fault-tolerant, they will continue execution uninterrupted regardless of failures in hardware, restarts, updates, or upgrades.

Because Golem workers are update-tolerant, they can be updated without affecting running workers.

Finally, because Golem workers are flaky-tolerant, the model executions will eventually succeed, even if the cause of the transient failure is a rate limit or other issue.

Summary

All of the examples presented here can be implemented reliably and scalably with existing and industry-proven architectures.

However, when using Golem, each becomes at least an order of magnitude simpler, without compromising reliability or scalability.

Golem gives you the ability to build highly-reliable distributed systems with simple code and simple architecture.

Fundamentals