Temporal and Architecture Shift at Grandeur Backend
An insight into Temporal and how we used it for the architecture shift at Grandeur backend.
We recently made an architecture shift from monolith to microservices at Grandeur. You must be wondering , “ Oh another company moving to microservices just to flex their tech muscle”. If you think like that you're not totally wrong. This architecture shift is a precursor to a whole bunch of new features at Grandeur and we’re not totally humbled about them.
Why did we move to microservices ? Why a monolith architecture just wasn’t enough for us? There’s a whole debate about monoliths versus microservices and it is cliche now to post about this in a few bullet points with no real world use case. Lucky for you , I got you. I’m not gonna tell you about microservices and monoliths in this , there’s already a whole bunch of content online, but what I’m gonna tell you today is about the reasons that pushed us into the realm of microservices and how to implement it using Temporal.
First, let’s talk about the firebase of IoT without MQTT integration ? If you know your IoT you know IoT lives and breathes MQTT. So, there is no question why we have to build an integration. However, it’s not so simple (like most things in life). As soon as we got our heads down and started working on this feature we soon realised the monolith backend was getting bloated. With so many services failure proofing the system was becoming a nightmare ( need to think about better reason to shift to microservices ).
How We actually did It
Remember I told you you’re not totally wrong about us flexing our tech muscle. Before we even thought about using Temporal, we decided to build our own microservices orchestration client called Fusion. I can write a whole blog about how we built Fusion, but just to scratch the surface , Services were run in their respective pods and Fusion server used the Pub/Sub model and redis queues to pass messages between Fusion client and the microservice. It worked perfectly for our use case but then we realised we needed to add optional reliability, and error logging. Remember, I told you we were building this from scratch !. This service started as a project to flex our muscles , but it soon became technically taxing on us. We were spending more time figuring things out with Fusion rather than working on our core features. To cut our tech tax we started to look for alternatives and we struck gold with Temporal and we have successfully made the shift from Fusion to Temporal and we couldn’t be less happier about it.
What is Temporal ? Well, in their own words “Temporal is a scalable and reliable runtime for Reentrant Processes called Temporal Workflow Executions”. Too much technical Jargon right ? In layman terms think of Workflow as your service, for example user verification and Temporal as the communication medium between your main application and the service. Don’t think of Temporal as a glorified message passing pipeline. There’s a lot more to it.
Deep Dive into Temporal
Oh my oh my, understanding a new technology is always a steep learning curve and in this case, for someone new to microservices and the associated jargon , it took quite a bit of going through documentation to finally understand how Temporal works. Good thing , since I’ve already gone through the pain of reading the documentation , I’ll give you my understanding of Temporal for the fair price of reading this blog.
We’re gonna understand temporal in the following order
- Temporal Platform
Honestly, If I were to start again I’d follow the following order, since I believe this is the most logical for someone who's new to distributed architecture.
1. Temporal Platform
In Temporal’s own words “The Temporal Platform consists of a Temporal Cluster and Worker Processes. Together these components create a runtime for Workflow Executions. In simpler words a Temporal Cluster consists of a server and a persistent Database. The server acts as a bridge between the user application and the microservice. On the other hand database store the following
- Tasks to be dispatched.
- Workflow state and history
- Namespace data and visibility data
In my experience I had the most difficulty in understanding the temporal server , so I’ll go a bit deep to explain it in more detail for my past self. Think of it only as an orchestra player. It receives requests from a user application ( Temporal client ). These requests can be signalling a workflow to start, stop, query past executions etc. Let’s take the most common case where a client requests to execute a workflow, the argument to the server in the simplest case contains a workflow id to be executed and the argument, think of workflow as the service to be executed for now. Each server maintains a number of queues. Each of these queues is responsible for maintaining a list of tasks to be executed. To keep it simple for now, whenever a client requests to execute a workflow it is added to the queue. A worker process ( a program that actually executes and is external to the server) continuously polls the task queue for new workflows to be executed. When a new workflow is added to the list , the worker dequeues it and executes the service. It then returns the result of the service to the server and then the server persists the result and returns it to the client.
In the simplest form , workflows are a sequence of steps defined by the business logic. These steps are defined in code and are required to be deterministic. It is a function written in the programming language you love and must be registered with a worker process for it to be executed. Temporal is designed in a way that a workflow can be paused for an infinite amount of time during its execution. In this case the state of the workflow execution is persisted in the temporal cluster. Whenever, it is time to resume the execution of the workflow, the previous state is fetched from the temporal cluster and the worker process starts executing the workflow from where it left off. So, if code is non - deterministic then it won’t be able to restart from the point it paused at.
However, you must be wondering that the deterministic nature of workflow may be a headache, since determinism enforces that the user code is not allowed to access anything that will return varying results, for example accessing file system or using functions for random number generations. To mitigate this issue we have activities which are called by a workflow to perform tasks in business logic that may be non-deterministic.
With the idea of activities in mind , the deterministic nature of workflow is easily understandable with an example. Think of a user background check workflow. In this workflow we have to use an external API to access that checks with the relevant agencies and returns a boolean value, true or false. Assume that the API is really slow to respond and can take several minutes to perform the necessary background verifications. In such cases the worker process that was executing the workflow may start working on another workflow execution , and the state of the current workflow is persisted in the cluster. When the worker process finally comes around to execute this workflow again , the state is resumed from where it left off and is not restarted again.
All the non-deterministic , failure prone part of your business logic is placed inside an activity. This is the simplest way to understand an activity. Another way to think of microservices in the Temporalverse is this. Take a service that does X , break it down into two components. The failure prince component is placed inside an activity and the deterministic part of your code is placed inside the workflow. Moreover, it is also a function that must be registered like a workflow for it to be accessed.
Another differentiation between workflow and an activity is the retry-able nature of activities. If for some reason an activity fails the activity can be retried , in theory infinite times. This allows only part of the code to fail without forcing the whole workflow execution to restart from line 1.
To give you an example , think of a user verification workflow. To verify a user you have to match the credentials with the ones found in the database. Since accessing databases is not allowed in a workflow , due to the deterministic constraints. You create an activity, let's say, ‘verify user’ and use it to return a boolean value of true or false depending on successful verification or not. The whole database access, filtering records , credential matching and what not are placed inside the activity function.
Worker does the actual work. Although there is a lot more to it , keeping this basic idea in mind will help you understand Worker in Temporal. It’s a process that continuously polls a task queue , dequeues a task from the queue, executes the task (can be a workflow or an activity) and returns the result. Now, going a bit deeper, a worker process runs separately from the temporal cluster and the developers are responsible for maintaining worker nodes. A single worker can listen on a number of task queues and whenever a new task appears, it starts executing it. Now you must be wondering, that we got two types of tasks, workflows and activities and the question must've come into your mind whether workflows and activities are executed separately or together. Well the answer to that is from an outsider point of view both seem to run in a single go and it seems that there is no context shift. However, activity and workflow execution are both different from Temporal perspective.
Whenever an activity spawns from a workflow or a workflow is about to be executed, a new task is added to the task queue. Now, it is about time to tell you that there is not a single task queue. There can be an arbitrary number of task queues and each task queue is divided into two sub-queues, workflow task queue and activity task queue. Moreover, a worker entity ( individual Worker within a Worker Process that listens to a specific Task Queue) is also subdivided into two workflow worker and an activity worker so that workflow execution and activity execution can make progress at the same time. Workers are also stateless, thus a single worker can handle an arbitrary number of workflows and activities. Thus a workflow that begins execution on one worker can potentially end execution on another worker.