Consistency Problems in A Microservice Architecture (Part III)

7 min readJul 10, 2019

This post is final part of a three-part series on the consistency problems arising in a microservice architecture. Finally, we are talking about distributed transactions.

Distributed Transactions with Event Sourcing

Event Sourcing is a more sophisticated architecture than event notification. At the conceptual level, they are distinctly different in that the event producer does not care about subsequent actions in the event notification architecture, whereas it does in the event sourcing architecture. However, in terms of implementation, the difference is much finer and more blurred. A great introduction to event sourcing can be found here.

Handle Distributed Transactions with Event Sourcing

Consider the following high-level microservices architecture of an e-commerce system. In this example, suppose the sequence of actions are: create a pending order (Order service) ⇒ reserve stock (Inventory service) ⇒ make payment (Payment service) ⇒ conclude order (Order service) and asynchronously send notification (Notification service).

High-level system view of the e-commerce application

For the happy path, the sequence of events are as follows:

Order service creates a new order, marked as “pending” state, and publishes an event named event_order_pending.
Inventory service listens to event_order_pending, reserves the stock and publishes an event_stock_reserved.
Payment service listens to event_stock_reserved, executes the payment logic and publishes an event_order_paid.
Order service listens to event_order_paid, updates the order to “created” state and publishes an event_order_created. Then the Order service returns the result to the client.
Notification service listens to event_order_created, sends a notification to the customer.

Let’s see what happens if payment fails:

Payment service publishes an event_order_payment_failed.
Both Order service and Inventory service pick up event_order_payment_failed and

a. Order service marks the order as “failed”.

b. Inventory service releases the reserved stock.

Pros and Cons

One striking feature of the event sourcing pattern is that there is lots of tangling between the services through events. Remember we mentioned in the beginning of this essay that events are also part of a service’s interface? That means these services are closely coupled in terms of business logic, and there could be cyclic dependencies if not designed carefully. This coupling adds one more item to the already long list of complexities of event sourcing (quoted from Martin Fowler’s presentation):

Event schema
Unfamiliar
Asynchrony
Versioning

All these complexities would make the design more difficult, and testing and troubleshooting more challenging.

Nonetheless, if we do want to handle transactions with event sourcing, we need to apply the local transaction technique discussed in the Section Workflow Consistency in EDA to ensure atomicity of local update and messaging.

When is ES suitable?

When you need an audit log, e.g. accounting application
When you high performance and scalability

Distributed Transactions with A Central Coordinator

We will use the same e-commerce example described in the previous section. The basic idea is to have the Order service function as a coordinator for the order creation task, hence the high-level architecture diagram looks like this:

High-level architecture of the e-commerce application

The flow for a successfully created order is as follows:

The flow for a failed order due to payment failure is as follows:

In this scenario, we need to roll back the reserved stock when the payment fails. Generally there is only one happy path and many different failure scenarios, and the Order service, as the coordinator, must handle all scenarios correctly. One practical way is to implement a state machine in the Order service, and the Order service must persist the state machine after each operation so that a failed transaction can be recovered forward or rolled back. One possible design of such a state transition is illustrated below:

A few elements must be in place to enable the implementation of this state machine:

There must be a unique transaction id that is used across the Order service, Payment service and Inventory service.
The Payment and Inventory services must be idempotent, so that the Order service can retry the same operation without worrying about unwanted side effects.
For each operation, the Payment and Inventory services should provide a corresponding compensation operation to roll back the previous operation. Again, idempotency must be supported in the compensation operation.
A cron job is needed inside the Order service (i.e. the coordinator) to repair failed transactions. The cron job must avoid race conditions with the normal execution of the state machine.

As you can see, even though the idea is simple, it actually requires lots of effort to implement it right, with strict semantics requirements imposed on the Inventory and Payment services.

Compare with the Saga Pattern

Earlier I mentioned that the way I describe as a workflow or a transaction is actually the same as the Saga pattern. Even though I didn’t intend to describe the Saga patterns (plenty of online materials and a few references are given at the end of the essay), a quick comparison may help to put things into perspective.

There are two popular implementations of the Saga pattern:

Event/Choreography: There is no central coordinator, and each service listens and publishes events and act upon events.
Command/Orchestration: There is a central coordinator to manage the sequence of actions at one place.

Clearly, the event sourcing approach described above is the same as the Saga choreography pattern. However, there is one important difference between my coordinator approach and the Saga orchestration pattern: I have used synchronous request/response RPCs instead of messaging. This communication mechanism difference actually has significant consequences, as summarised below:

Central coordinator with difference communication mechansims

Therefore, it may be a good idea to start with RPC but design a clear separation of the interface adaptation layer and the logic layer, so that RPC can be replaced with messaging later with minimum effort.

Comparing the Solutions

The Differences

Event Sourcing

Asynchronous
Performant
Less familiar programming style
Business logic is distributed, and more coupling between services
Possibly higher latency due to the message broker
Difficult to add new step in a transaction
Difficult to reason about
Risk of cyclic dependency
More difficult to test and troubleshoot

Central Coordinator with RPC

Synchronous
More familiar programming style
Centralised logic, and loose coupling between services
Participant is simple
Lower latency
Easier to add a new step
No cyclic dependency
Easier to reason about
Easier to test and troubleshoot
Less performant
Coordinator must implement the state machine carefully

The Commonalities

Both approaches implement a state machine to fulfill a distributed transaction. The event sourcing approach implements the state machine implicitly, which is done collectively by all involved services. In contrast, the coordinator approach implements the state machine explicitly.
Individual services must provide a compensation operation for each (forward) operation
Individual services must provide idempotent semantics
Only support eventual consistency, as it’s very difficult to support isolation between transactions hence must tolerate in-flight inconsistency

Wrap Up

Handling distributed transactions is not easy. I hope this essay sets you up in a position to understand the various issues you need to consider when designing amicroservice architecture.

If I am to choose between the event sourcing vs. coordination approach for handling distributed transactions in microservices, I would prefer the coordination. I would only consider the event sourcing if it is supported out-of-box by some framework. Even so, I may still prefer the coordination approach due to its simplicity.

Although at a high-level, the solutions presented seem straightforward, it calls for careful design to actually implement everything right. I have been skimming on the details of idempotency. I have also completely ignored the concurrency issues that may arise from distributed transactions.

Concurrency will become an issue when we scale out the microservices horizontally, i.e. deploying multiple instances. Imagine that the coordinator sends/publishes the command/event to instance A and later sends/publishes the rollback command/event to instance B. It is likely that instance B may attempt to execute the compensation before instance A even starts to execute the (forward) operation. The result might be the compensation does nothing while the (forward) operation gets executed, which is not what we want. This could happen more often with asynchronous messaging, but it can also happen with synchronous RPC in the event of network partitioning (timeout).

I hope I will find time to write about idempotency and concurrency issues, so please check back later :-)

Part II: https://medium.com/@zongwb/distributed-transactions-in-a-microservice-architecture-b4d6494de59e