The way one service talks to another service in the Microservices world, is via network calls like gRPC, HTTP, or event-driven mechanisms. One major difference between in-memory calls and network calls (remote calls) is that the remote calls can fail, and the remote server may not even respond until a timeout happens. Just imagine someone placing an order on an eCommerce site and the payment service not responding or failing, the site will definitely end up losing a customer. As the Amazon.com CTO said, Everything fails all of the time, you need to have proper design patterns in place to take care of any API call failures. This is where we make use of the Circuit Breaker Pattern, in Spring Cloud we will use Resilience4j to implement this pattern. Basically, the idea is to prevent the failed service with too many retry requests, instead allowing it some time to recover.
1. What is a Circuit Breaker Pattern?
A circuit breaker pattern works similarly to an electrical circuit breaker. When the number of consecutive failures crosses a certain threshold, the circuit breaker trips, and no further connection is established with the remote server for a specified duration. The remote service can use this duration to recover/restart itself. After this timeout, the circuit breaker runs some tests to see if the request passes through, if so it resumes forwarding the requests to this remote services again. If the test fails, it waits again for the specified duration.
A circuit breaker Pattern helps us prevent subsequent (cascading) request failures when a remote service is down.
A circuit breaker can be in one of three states:
- Closed: The remote service is working as expected. No short-circuiting is required.
- Open: Remote service not responding as expected, maybe down or frozen. All requests are short-circuited. This state is achieved only when a specified number of failures have occurred.
- Half_Open: Records the number of successful attempts to invoke the remote calls. This helps in checking if the remote server is back online and working as expected.
2. Challenges and considerations:
- Exception Handling: There must be proper exception handling mechanisms when a service request fails through the Circuit breaker Pattern. Depending upon the business logic, you may want to invoke alternative APIs when a specific request fails.
- Clear separation of Exception types: A service can fail due to several reasons, e.g. unable to process requests due to overload. The Circuit breaker should be able to clearly identify the cause and behave accordingly.
- Logging: Adequate logging should be present even when the Circuit is in half-open or open status, so it helps the administrator to further optimize the failure or success threshold values.
- Concurrency: A single circuit breaker instance can be accessed by multiple concurrent requests of an Application. It should be able to process the concurrent requests and handle the failures if any.
- Trip immediately when required: Occasionally certain errors may contain enough information for the Circuit breaker to trip immediately. E.g. Overloaded requests may cause an error where immediate retry is not recommended.
- Replaying the failed requests: Rather than failing the requests quickly, a circuit breaker could also replay the failed requests once the service is up.
3. Retry vs Circuit Breaker
Retry pattern is useful in the scenario of Transient Failures – failures that are temporary and last only for a short amount of time. For handling simple temporary errors, retry could make more sense than using a complex Circuit Breaker Pattern. However, finding the right use case for each of these patterns needs a lot of expertise.
There are several strategies to decide upon the retry intervals:
- Regular intervals: Retry in every 5 secs or
5 + random_miliseconds
- Incremental intervals: Retry in every 2, then 3 , then 4 etc seconds
- Exponential back-off: retry in every 1, 2, 4, 16, etc seconds
Libraries like Resilience4j provide both patterns. It is mostly more appropriate to combine Retry and Circuit breaker Patterns to get a comprehensive approach to handling faults.
4. Implementation of Circuit breaker:
- Python circuitbreaker – a python implementation of the Circuit Breaker Pattern. This see,ms a stable implementation.
- Alternatively, you can also try PyBreaker: Python implementation of the Circuit Breaker pattern, described in Michael T. Nygard’s book Release It!.
- Resilience4j: It is a lightweight, easy-to-use fault tolerance library inspired by Netflix Hystrix, but designed for Java 8 and functional programming.
Hystrix is no longer actively maintained.
- Sentinel: A powerful flow control component enabling reliability, resilience, and monitoring for microservices from Alibaba cloud. It is a lot more powerful than Resilience4j, but many of the documents are in Chinese (use google translator)
You can also implement the complete logic yourself if it is a small use case. For complex logic, I recommend using a good library that allows you to visualize the traces for debugging.
We will use Resilience4j in this tutorial. It provides both Retry and Circuit Breaker Pattern. We will apply them carefully to deal with the failures. I would also recommend you to try out Sentinel.