This post is about my collective experience in distributed systems as Developer, Architect, Systems Engineer, SRE and Resiliency Lead. With this many hats, worked in different area to run the systems in a more resilient on Public and Private Cloud.
In this journey there are lot of failures and success in running Distributed Systems irrespective of Cloud or On-Prem.
The Strategy of Creating and Running Distributed Systems in on-prem and cloud don’t have much differences. In cloud you are indirectly responsible from maintenance perspective but in end you are 100% responsible for your business SLA.
Because the cloud outages always teach us about the importance of hybrid systems.
Distributed systems are interesting because it explodes your assumptions interestingly and push you to learn more and you do not get the guarantee in a single process which support minimal business loads.
Idempotency! Idempotency!
Idempotent and RPC
Remote Procedure Call (RPC) is a powerful technique for constructing distributed, client-server-based applications
The RPC handler to be idempotent because it may receive your request multiple times. when you respond to the clients, it is not mandated to receive your response. so it’s also idempotent to handle retry
A message handler to be idempotent to handle the same message multiple times.
If your database or table supports data streams like dynamodb place the purchase order, and in the background listen to committed changes and send the email there. So we can make email sending as idempotent.
RPCs are often more expensive than local function calls. Always make RPCs in parallel. RPCs have a higher overhead than function calls. Put multiple items in a single request. But don’t create single cluster of RPC requests when you expect consistent latency because RPCs define limits.
Retry Storm
Always you should control and monitor the rate of requests you pull/send to your upstream/downstream to avoid retry storms
Prioritize Client Side throttle like Server Side throttle
Race conditions
In Queue request sent/received may not be in the same order assuming will lead to race conditions.
A state or reference you read from another system/service/database isn't necessarily the same when you write it back. To avoid the race conditions in this scenario atomically verify the state is still same via ConditionChecks/transactions.
Clocks on different machines may show different times. Don't rely on timestamp comparison to get it 100% correct else it will end to race conditions.
The race conditions can be addressed by a synchronization mechanism, and in most distributed systems it is your database that supports atomic condition checks / transactions.
Don’t make mutations in two systems atomically, e.g. Inserting a purchase order in the database and sending a confirmation email is not atomic. . Because you can mutate purchase order system and it fail on email.
When calling a stateful dependency, such as a distributed database, different items in a batch might be processed by different nodes. Always Shard the batch before forming batches by sorting items.
With a single process, if it crashes there isn't much you can do, so you don't have to bother.
In a distributed system your service should survive when a single process/machine fails, and try to fail gracefully if your dependency service is hard-down.
When one process fails, the system should somehow notice that (failed health checks, expired leases, explicit acknowledgments of completed work) and redistribute the work among survivors.
Distributed State
With a single process, if it crashes there isn't much you can do, so you don't have to bother.
In a distributed system your service should need to survive even when a single process/machine fails(Failure is inevitable in distributed systems), and try to fail gracefully if your dependency service is hard-down.
When one process fails, the system should somehow notice that (failed health checks, expired leases, explicit acknowledgments of completed work) and redistribute the work among other survivors gracefully.
State replicas should live in different availability zones which fail independently. Distribute copies even in geographically distant locations to survive natural disasters, and for business continuity.
The most advantage of distributed database is it takes care of state replication and consistency in most systems.
Throttle
A single process can rarely overwhelm itself. In a distributed system, one service can take another down by sending too many requests. So rate limiter and circuit breaker patterns are default in the system. The good news is most of the sidecar systems and API Gateway by default have these features now
A service must protect itself by throttling incoming requests, and you sometimes should protect others by throttling outgoing requests.
Security
Different parts of the same process trust each other. A distributed system should minimize the impact when one node is compromised: each node may have to verify each incoming call/connection via authentication, authorization, transport encryption.
Testing
Testing a distributed system is harder. It may require multiple machines and skilled resources. Distributed systems always require nonfunctional testing in addition to functional. Chaos Engineering is the term now largely adopted by more companies to instrument failures and validate the consistency of Fallback and Recovery mechanisms.
Distributed System adoption is always drive to meet the business needs and governance requirements. Business and Technical teams to be aligned well to make the right technical growth.
Comments