How did we switch from a synchronous buying process to an asynchronous one?
(Une fois de plus, merci à Jason Tan pour son peer review!)
Critical use cases in applications are typically the more complex ones. One of these use cases at Sherweb is the process of buying something. Here is an (oversimplified) overview of that process:
1. Resolve the price of the item.
2. Check if there are any promotion available for the item.
3. Create necessary artifacts in the billing engine to properly bill the client.
4. Deliver the item’s associated services in external infrastructures (i.e., adding O365 mailbox seats in the Microsoft ecosystem)
5. Send confirmation email to the client (Optional. Depends on a lot of things.)
Many years ago, when we started developing our Web portal, we did a good move by handling the service delivery in an async manner. Because the integrations with our external providers (Microsoft, Google, etc.) are complex and relatively time-consuming, it was out of question to block the user’s actions in the Web application until the service delivery was complete. We introduced at that moment a visual icon informing the user that something was ongoing in the system, with a tracking mechanism to follow each step’s progression. That was pretty cool back then.
All that worked very well for many years. At that moment, we had a typical monolith code base, so it was natural to add this asynchronous pattern when communicating outside our infrastructure. Over time, many business rules and use cases were added, meaning more complexity was added too. We hit typical performance issues for a system of that size and maturity (database timeouts, deadlocks, etc.).
We started the typical journey of splitting a monolith code base into micro-services. But without properly redesigning aggregate roots, there is no silver bullet, and it is easy to end up in a distributed monolith (which is even worse). Without falling into that trap, we still ended up with a situation where our buying use case was becoming harder and harder to manage; micro-service extractions mean multiple internal API calls (plus ongoing performance issues that were not fixed by the micro-service extractions). Even by refactoring our code base and aggregate roots over and over again, there is a mathematical limit of how much work you can accomplish after a single click in a Web application with a user blocked and waiting for the end of the processing.
To solve all this, we implemented a brand-new solution (that was invented in…1987): Sagas.
Sagas are a way to solve those general problems:
- Long business processes with multiple steps working in a single database transaction tend to fail.
- Long business processes with multiple steps split across multiple microservices chained API calls tend to fail.
By splitting a complex business process in multiple steps (like our buying process), a Saga can:
- Handle the long business process with multiple small transactions that can be achieved individually.
- Handle the long business process with multiple API calls not executed in the same runtime execution.
Both leverage the concept of eventual consistency to execute individual steps and trigger others in an asynchronously manner. This means that we can quickly capture the user’s intention (what items are being bought), free the user in the Web application to let him take other actions, and start handling the buying business process later, asynchronously.
One of the major pitfalls with this approach is the ordering validation process. We cannot allow the user to submit an order, let the user go, and crash later because something is fundamentally wrong with the order (i.e., if two order items are mutual exclusive, or if the client asked for a quantity exceeding the maximum). It becomes critical to properly validate the order upfront because we cannot easily recover from an invalid one (compared to the synchronous pattern where the user sees its order crashing live). It would imply to contact the user (maybe they are not even logged in the application anymore), explaining what is wrong, telling them what corrections can be made, etc. Even with an automated process, this is a lot of work. Thanks to the Saga strategy, all other errors can be handled relatively easily, without the user even noticing anything. For example, if an external API is down, the system can retry later automatically.
We implemented our Saga with the Orchestration Pattern (aka State Machine), as opposed to the Choreography Pattern (aka Event Chaining). The Choreography Pattern tends to produce a system that is hard to understand and hard to track, even if people naturally follow this path because it is easier to implement.
With the Orchestration Pattern, our business process is well defined in a single aggregate root, meaning each step is well integrated in a single boundary. It is easy to obtain metrics like the average time required for a step, which business process is currently halted, what the compensating actions for each step are and even the unit testing of each step and the whole business process is easy. Accomplishing that with the Choreography Pattern is simply impossible to achieve in an effective way.
Of course, building this whole system is not free, it assumes that we have a well-integrated message queueing (MQ) system. For many years, we used a strategy where we hooked the MQ sending process to the database transaction completion, meaning during a use case, sending a MQ message in fact accumulated the message in memory until the transaction was complete. Then, the MQ messages were sent to the MQ broker. It was not perfect because if at the very last moment there was a problem with the sending process to the broker, MQ messages could be lost forever. But it worked well for us for many years, we handled millions of messages that way.
But when we introduced our Saga system to handle the buying process, we also started to use the recommended “Message Outbox” pattern. Instead of sending MQ messages during a use case, we save them in the database in a dedicated events table at the same time that we persist our modified aggregate root, all this atomically. No more external API call to a MQ broker during our business use case. This was a major shift compared to our previous approach. All you need to have is an internal system processing those persisted messages by sending them to the MQ broker. And since those MQ messages are not working in a “fire and forget” manner anymore because they have been persisted in the database, our debugging and investigation sessions are far faster and easier. Even if it is not exactly event sourcing, this system gives us similar benefits. In fact, at this point, embracing the event sourcing paradigm is just a step away.
I had the chance to be part of all the different milestones from day 1 (more than 10 years ago when I started my career!). In those years, I made some bad infrastructure decisions or implementation mistakes while experimenting and learning with those strategies. It is part of the process, I guess. But it gave me the opportunity to build a solid experience by myself by learning from my errors and by fixing them. Today all this seems obvious, but anyway, I hope that blog post can accelerate your learning process!