Consistency Problems in A Microservice Architecture (Part III)
This post is final part of a three-part series on the consistency problems arising in a microservice architecture. Finally, we are talking about distributed transactions.
Distributed Transactions with Event Sourcing
Event Sourcing is a more sophisticated architecture than event notification. At the conceptual level, they are distinctly different in that the event producer does not care about subsequent actions in the event notification architecture, whereas it does in the event sourcing architecture. However, in terms of implementation, the difference is much finer and more blurred. A great introduction to event sourcing can be found here.
Handle Distributed Transactions with Event Sourcing
Consider the following high-level microservices architecture of an e-commerce system. In this example, suppose the sequence of actions are: create a pending order (Order service) ⇒ reserve stock (Inventory service) ⇒ make payment (Payment service) ⇒ conclude order (Order service) and asynchronously send notification (Notification service).
For the happy path, the sequence of events are as follows:
- Order service creates a new order, marked as “pending” state, and publishes an event named event_order_pending.
- Inventory service listens to event_order_pending, reserves the stock and publishes an event_stock_reserved.
- Payment service listens to event_stock_reserved, executes the payment logic and publishes an event_order_paid.
- Order service listens to event_order_paid, updates the order to “created” state and publishes an event_order_created. Then the Order service returns the result to the client.
- Notification service listens to event_order_created, sends a notification to the customer.
Let’s see what happens if payment fails:
- Payment service publishes an event_order_payment_failed.
- Both Order service and Inventory service pick up event_order_payment_failed and
a. Order service marks the order as “failed”.
b. Inventory service releases the reserved stock.
Pros and Cons
One striking feature of the event sourcing pattern is that there is lots of tangling between the services through events. Remember we mentioned in the beginning of this essay that events are also part of a service’s interface? That means these services are closely coupled in terms of business logic, and there could be cyclic dependencies if not designed carefully. This coupling adds one more item to the already long list of complexities of event sourcing (quoted from Martin Fowler’s presentation):
- Event schema
- Unfamiliar
- Asynchrony
- Versioning
All these complexities would make the design more difficult, and testing and troubleshooting more challenging.
Nonetheless, if we do want to handle transactions with event sourcing, we need to apply the local transaction technique discussed in the Section Workflow Consistency in EDA to ensure atomicity of local update and messaging.
When is ES suitable?
- When you need an audit log, e.g. accounting application
- When you high performance and scalability
Distributed Transactions with A Central Coordinator
We will use the same e-commerce example described in the previous section. The basic idea is to have the Order service function as a coordinator for the order creation task, hence the high-level architecture diagram looks like this:
The flow for a successfully created order is as follows:
The flow for a failed order due to payment failure is as follows:
In this scenario, we need to roll back the reserved stock when the payment fails. Generally there is only one happy path and many different failure scenarios, and the Order service, as the coordinator, must handle all scenarios correctly. One practical way is to implement a state machine in the Order service, and the Order service must persist the state machine after each operation so that a failed transaction can be recovered forward or rolled back. One possible design of such a state transition is illustrated below:
A few elements must be in place to enable the implementation of this state machine:
- There must be a unique transaction id that is used across the Order service, Payment service and Inventory service.
- The Payment and Inventory services must be idempotent, so that the Order service can retry the same operation without worrying about unwanted side effects.
- For each operation, the Payment and Inventory services should provide a corresponding compensation operation to roll back the previous operation. Again, idempotency must be supported in the compensation operation.
- A cron job is needed inside the Order service (i.e. the coordinator) to repair failed transactions. The cron job must avoid race conditions with the normal execution of the state machine.
As you can see, even though the idea is simple, it actually requires lots of effort to implement it right, with strict semantics requirements imposed on the Inventory and Payment services.
Compare with the Saga Pattern
Earlier I mentioned that the way I describe as a workflow or a transaction is actually the same as the Saga pattern. Even though I didn’t intend to describe the Saga patterns (plenty of online materials and a few references are given at the end of the essay), a quick comparison may help to put things into perspective.
There are two popular implementations of the Saga pattern:
- Event/Choreography: There is no central coordinator, and each service listens and publishes events and act upon events.
- Command/Orchestration: There is a central coordinator to manage the sequence of actions at one place.
Clearly, the event sourcing approach described above is the same as the Saga choreography pattern. However, there is one important difference between my coordinator approach and the Saga orchestration pattern: I have used synchronous request/response RPCs instead of messaging. This communication mechanism difference actually has significant consequences, as summarised below:
Therefore, it may be a good idea to start with RPC but design a clear separation of the interface adaptation layer and the logic layer, so that RPC can be replaced with messaging later with minimum effort.
Comparing the Solutions
The Differences
Event Sourcing
- Asynchronous
- Performant
- Less familiar programming style
- Business logic is distributed, and more coupling between services
- Possibly higher latency due to the message broker
- Difficult to add new step in a transaction
- Difficult to reason about
- Risk of cyclic dependency
- More difficult to test and troubleshoot
Central Coordinator with RPC
- Synchronous
- More familiar programming style
- Centralised logic, and loose coupling between services
- Participant is simple
- Lower latency
- Easier to add a new step
- No cyclic dependency
- Easier to reason about
- Easier to test and troubleshoot
- Less performant
- Coordinator must implement the state machine carefully
The Commonalities
- Both approaches implement a state machine to fulfill a distributed transaction. The event sourcing approach implements the state machine implicitly, which is done collectively by all involved services. In contrast, the coordinator approach implements the state machine explicitly.
- Individual services must provide a compensation operation for each (forward) operation
- Individual services must provide idempotent semantics
- Only support eventual consistency, as it’s very difficult to support isolation between transactions hence must tolerate in-flight inconsistency
Wrap Up
Handling distributed transactions is not easy. I hope this essay sets you up in a position to understand the various issues you need to consider when designing amicroservice architecture.
If I am to choose between the event sourcing vs. coordination approach for handling distributed transactions in microservices, I would prefer the coordination. I would only consider the event sourcing if it is supported out-of-box by some framework. Even so, I may still prefer the coordination approach due to its simplicity.
Although at a high-level, the solutions presented seem straightforward, it calls for careful design to actually implement everything right. I have been skimming on the details of idempotency. I have also completely ignored the concurrency issues that may arise from distributed transactions.
Concurrency will become an issue when we scale out the microservices horizontally, i.e. deploying multiple instances. Imagine that the coordinator sends/publishes the command/event to instance A and later sends/publishes the rollback command/event to instance B. It is likely that instance B may attempt to execute the compensation before instance A even starts to execute the (forward) operation. The result might be the compensation does nothing while the (forward) operation gets executed, which is not what we want. This could happen more often with asynchronous messaging, but it can also happen with synchronous RPC in the event of network partitioning (timeout).
I hope I will find time to write about idempotency and concurrency issues, so please check back later :-)
Part II: https://medium.com/@zongwb/distributed-transactions-in-a-microservice-architecture-b4d6494de59e
References
BASE model. https://queue.acm.org/detail.cfm?id=1394128
Martin Fowler. The Many Meanings of Event-Driven Architecture https://www.youtube.com/watch?v=STKCRSUsyP0
Chris Richardson. Using Saga Patterns to Maintain Data Consistency in a Microservice Architecture. https://www.youtube.com/watch?v=YPbGW3Fnmbc
Randy Shoup. Managing Data in Microservices. https://www.youtube.com/watch?v=E8-e-3fRHBw
https://blog.couchbase.com/saga-pattern-implement-business-transactions-using-microservices-part/
https://blog.couchbase.com/saga-pattern-implement-business-transactions-using-microservices-part-2/
https://www.nginx.com/blog/event-driven-data-management-microservices/
https://en.wikipedia.org/wiki/Two-phase_commit_protocol
https://en.wikipedia.org/wiki/Consistency_model
https://martinfowler.com/articles/microservices.html
https://martinfowler.com/eaaDev/EventSourcing.html
https://github.com/cer/event-sourcing-examples/wiki
https://github.com/cer/event-sourcing-examples/wiki/DeveloperGuide