Consistency Problems in A Microservice Architecture (Part I)
This three-part series discusses the consistency problems arising in a microservice architecture. It is not meant to be comprehensive treatment on the topic but instead focuses on the common application scenarios in practice and practical solutions that can be implemented from scratch. The target audience is software engineers working with Internet companies (in contrast to banking or financial services), especially tech startups.
This post is part 1 of the series, and we will define the problems and application scenarios.
Introduction
To start off, I think it is important to qualify the scope of this essay. We will introduce the basics of consistency issues and discuss the solutions to these issues in a microservice architecture without relying on existing frameworks. We will only discuss solutions that achieve eventual consistency. I feel this is a reasonable assumption and practical trade-off as most applications actually can tolerate the in-flight inconsistency before the whole transaction completes.
As many of the technical terms in this essay are widely used in the software community but are often vague enough to mean different things by different people, let’s first lay down our definitions for a few common terms:
- Message: A physical data record passed around between services.
- Event: An event is a message that is expected to be processed by some downstream event processors. The key difference between an event and a message is that an event carries some semantics and should be considered as a first-class construct in system design.
We should note that distributed transactions (probably the most difficult consistency problems) are not new in the enterprise IT world, and there exist frameworks to handle them, typically in Java/Scala. However, it’s less established in microservice and I have yet to find a battle-tested framework which is not based on Java.
Quick Recap on Microservices, Event-driven Architecture, and Messaging
Microservice-oriented architecture has become popular in recent years and there are tons of materials on this topic on the Internet. A good introduction is on the nginx website (https://www.nginx.com/blog/introduction-to-microservices/). However, there is no formal definition for microservice. In my view, a microservice serves a well-defined business boundary, has a well-defined interface and owns its private database. All updates within a microservice can be guaranteed transactional. Access to a service’s data must go through its published interface.
We also hear the term event-driven architecture (EDA) very often, even though it can mean many different patterns (refer to this excellent talk by Martin Fowler ). Regardless of the exact pattern, there will always be some kind of message queue to store the events, and the services in EDA constantly consume and produce events to drive the state change of the application.
I think in practice, any microservice architecture almost always has some sort of event-driven elements, and vice versa. (Shall we use a more precise term like event-driven microservice-oriented architecture?)
Almost with no exception, we will need some kind of messaging component in an event-driven and/or microservice architecture. It is important to note that when designing our system with a messaging component, we must choose one from the two (practical) delivery guarantees: at-least-once delivery and at-most-once delivery. In my opinion, exactly-once delivery is not possible without significant delay and intrusion to the application. With this in mind, we must always assume at-least-once delivery guarantee for transactions (as at-most-once guarantee is clearly not acceptable), and that means we must be prepared to handle duplicate messages/events in the downstream processor. We will talk about it later when we talk about idempotency.
As a side note, even though not closely related to our topic here, when we design a service, bear in mind that its interface includes any mechanism for getting data into or out of the service:
- Synchronous request/response
- Events the service consumes
- Events the service produces
- Bulk data reads and writes (e.g. ETL)
Consistency Problems
Just a quick recap. The four properties of a transaction are ACID, namely:
- Atomicity
- Consistency
- Isolation
- Durability
We’ll ignore isolation and durability for simplicity. In practice, durability should already be supported by the underlying databases, and isolation should be relaxed as we (could only) aim for eventual consistency (For the different consistency models, see https://en.wikipedia.org/wiki/Consistency_model).
Classification of Consistency Problems
Broadly speaking, there are two types of consistency problems in a distributed system: replication consistency and transaction consistency. Transaction consistency can be further classified into more specific use cases, as shown in the following table.
Table: Classification of Consistency Problems
Replication consistency normally arises when the same data needs to be replicated on multiple instances for high availability (and also higher read throughput), and the popular protocols to handle replication consistency are Paxos and Raft. Generally speaking, we don’t need to worry about replication consistency when developing applications as it should be handled by the middleware or database. Neither do we need to worry about the transactions within a single database as long as we choose one that supports ACID.
Distributed transactions across multiple databases are not our concern, as I believe a well-thought microservice architecture should avoid such scenarios as much as possible. That means, either all updates are done in a single database within one microservice, or the databases should be hid behind respective microservices and we should then handle the transactions at the microservice level. (When you have to update two databases within a microservice in one transaction, I think you should consider to combine them into one database to make you life easier.)
Types of Application Scenarios
Now let’s turn our attention to the application scenarios where we will face the consistency problems. It makes things clearer if we classify the application scenarios into two types:
- Distributed workflow
All updates should be eventually successful, and some update can be committed before the others and will never need to be rolled back if the other updates fail later, as far as the workflow is concerned. For example, we may want to send a welcome email upon a new account creation. The account can be created successfully even if the email service cannot send the welcome email right away. Note that out of band rollback is not considered part of the workflow.
- Distributed transaction
All updates must be done atomically; that is, in a multi-step transaction we need to roll back all preceding steps if any step fails. We only aim for eventual consistency and must tolerate the in-flight inconsistency when the steps are being executed. An example would be an order service in an e-commerce application, where we need to reserve the stock before finally creating an order.
The key difference between these two types of scenarios is that there is no atomicity per se in a workflow, as there is no need for rollback and we will retry the failed steps on a best-effort basis. In contrast, atomicity is mandatory for a transaction. In addition, the window of in-flight inconsistency in a workflow can be arbitrarily long before eventual consistency is achieved.
Anatomy of A Workflow or Transaction
Now let’s look at what actually happens when processing a workflow/transaction. A distributed transaction or workflow consists of multiple steps and each step is a local transaction handled by one microservice.
Let’s denote the local transactions as T1, T2 and so on. To be able to roll back a transaction, each local transaction needs to have a compensating transaction, denoted C1, C2 and so on. Hence, in a workflow, the sequence is always T1, T2 till Tn. On the other hand, for a transaction, when T3 fails the sequence will be T1, T2, T3, C2, C1, so that the T2 is rolled back by C2 and T1 is rolled back by C1. As a metaphor, we can visualize a workflow as one-way traffic and a transaction as two-way traffic.
Forward Recovery vs. Backward Compensation
Rollback is a backward compensation process, which maintains atomicity of a transaction. What makes things more complicated is that rollback can also fail. Hence there should be a mechanism to retry the compensation.
Note that we do not always have to roll back a transaction; we may consider forward recovery in some cases. Forward recovery means we retry the failed local transaction and then execute the remaining local transactions. Forward recovery can improve the system’s overall reliability by hiding the individual service’s temporary failures.
It should now be apparent that this process is basically a state machine that drives the transaction to completion (either forward or backward).
The Saga Pattern
For the sake of relating our discussion to the literature, the process described above is actually the Saga pattern that was proposed more than 30 years ago. Even though the purpose of this essay was not to describe the Saga pattern, you can see that we somehow arrived at the pattern quite naturally just by thinking about a business process.
(Definition: A Saga is a sequence of local transactions where each transaction updates data within a single service. The first transaction is initiated by an external request corresponding to the system operation, and then each subsequent step is triggered by the completion of the previous one and it contains the mechanism of handling rollback for the whole transaction sequence.)
The next post will dive into the workflow consistency problem.