Practice Coding and System Design mock interviews with Senior Software Engineers from Silicon Valley

Tuesday, February 18, 2020

The Circuit Breaker Pattern in System Design and Architecture interviews

Handle faults that might take a variable amount of time to recover from, when connecting to a remote service or resource. This can improve the stability and resiliency of an application. Explaining the Retry pattern together with the Circuit Breaker pattern come up often during the System Design interview. For system design and architecture interviews at Google, Facebook, Uber, Airbnb, etc. being familiar with these patterns is a strong requirement.

Context and problem

In a distributed environment, calls to remote resources and services can fail due to transient faults, such as slow network connections, timeouts, or the resources being overcommitted or temporarily unavailable. These faults typically correct themselves after a short period of time, and a robust cloud application should be prepared to handle them by using a strategy such as the Retry system design pattern.

However, there can also be situations where faults are due to unanticipated events, and that might take much longer to fix. These faults can range in severity from a partial loss of connectivity to the complete failure of a service. In these situations it might be pointless for an application to continually retry an operation that is unlikely to succeed, and instead the application should quickly accept that the operation has failed and handle this failure accordingly.

Additionally, if a service is very busy, failure in one part of the system might lead to cascading failures. For example, an operation that invokes a service could be configured to implement a timeout, and reply with a failure message if the service fails to respond within this period. However, this strategy could cause many concurrent requests to the same operation to be blocked until the timeout period expires. These blocked requests might hold critical system resources such as memory, threads, database connections, and so on. Consequently, these resources could become exhausted, causing failure of other possibly unrelated parts of the system that need to use the same resources. In these situations, it would be preferable for the operation to fail immediately, and only attempt to invoke the service if it's likely to succeed. Note that setting a shorter timeout might help to resolve this problem, but the timeout shouldn't be so short that the operation fails most of the time, even if the request to the service would eventually succeed.

System Design options and tradeoffs

The Circuit Breaker system design pattern can prevent an application from repeatedly trying to execute an operation that's likely to fail. Allowing it to continue without waiting for the fault to be fixed or wasting CPU cycles while it determines that the fault is long lasting. The Circuit Breaker system design pattern also enables an application to detect whether the fault has been resolved. If the problem appears to have been fixed, the application can try to invoke the operation.

The purpose of the Circuit Breaker system design pattern is different than the Retry system design pattern. The Retry pattern enables an application to retry an operation in the expectation that it'll succeed. The Circuit Breaker pattern prevents an application from performing an operation that is likely to fail. In the system design interview we can combine these two patterns by using the Retry pattern to invoke an operation through a circuit breaker. However, the retry logic should be sensitive to any exceptions returned by the circuit breaker and abandon retry attempts if the circuit breaker indicates that a fault is not transient.

A circuit breaker acts as a proxy for operations that might fail. The proxy should monitor the number of recent failures that have occurred, and use this information to decide whether to allow the operation to proceed, or simply return an exception immediately.

The proxy can be implemented as a state machine with the following states that mimic the functionality of an electrical circuit breaker:

  • Closed: The request from the application is routed to the operation. The proxy maintains a count of the number of recent failures, and if the call to the operation is unsuccessful the proxy increments this count. If the number of recent failures exceeds a specified threshold within a given time period, the proxy is placed into the Open state. At this point the proxy starts a timeout timer, and when this timer expires the proxy is placed into the Half-Open state. The purpose of the timeout timer is to give the system time to fix the problem that caused the failure before allowing the application to try to perform the operation again
  • Open: The request from the application fails immediately and an exception is returned to the application.
  • Half-Open: A limited number of requests from the application are allowed to pass through and invoke the operation. If these requests are successful, it's assumed that the fault that was previously causing the failure has been fixed and the circuit breaker switches to the Closed state (the failure counter is reset). If any request fails, the circuit breaker assumes that the fault is still present so it reverts back to the Open state and restarts the timeout timer to give the system a further period of time to recover from the failure.

It is important to discuss the purpose of the Half-Open state during the system design interview. The Half-Open state is useful to prevent a recovering service from suddenly being flooded with requests. As a service recovers, it might be able to support a limited volume of requests until the recovery is complete, but while recovery is in progress a flood of work can cause the service to time out or fail again.

The failure counter used by the Closed state is time based. It's automatically reset at periodic intervals. This helps to prevent the circuit breaker from entering the Open state if it experiences occasional failures. The failure threshold that trips the circuit breaker into the Open state is only reached when a specified number of failures have occurred during a specified interval. The counter used by the Half-Open state records the number of successful attempts to invoke the operation. The circuit breaker reverts to the Closed state after a specified number of consecutive operation invocations have been successful. If any invocation fails, the circuit breaker enters the Open state immediately and the success counter will be reset the next time it enters the Half-Open state.

The Circuit Breaker pattern provides stability while the system recovers from a failure and minimizes the impact on performance. It can help to maintain the response time of the system by quickly rejecting a request for an operation that's likely to fail, rather than waiting for the operation to time out, or never return. If the circuit breaker raises an event each time it changes state, this information can be used to monitor the health of the part of the system protected by the circuit breaker, or to alert an administrator when a circuit breaker trips to the Open state.

The pattern is customizable and can be adapted according to the type of the possible failure and the requirements of the system design interview. For example, you can apply an increasing timeout timer to a circuit breaker. You could place the circuit breaker in the Open state for a few seconds initially, and then if the failure hasn't been resolved increase the timeout to a few minutes, and so on. In some cases, rather than the Open state returning failure and raising an exception, it could be useful to return a default value that is meaningful to the application.

Issues and trade-offs to discuss in the system design interview

You should consider the following points when deciding how to approach this in a system design and architecture interview:

  • Exception Handling. An application invoking an operation through a circuit breaker must be prepared to handle the exceptions raised if the operation is unavailable. The way exceptions are handled will be application specific. For example, an application could temporarily degrade its functionality, invoke an alternative operation to try to perform the same task or obtain the same data, or report the exception to the user and ask them to try again later.
  • Types of Exceptions. A request might fail for many reasons, some of which might indicate a more severe type of failure than others. For example, a request might fail because a remote service has crashed and will take several minutes to recover, or because of a timeout due to the service being temporarily overloaded. A circuit breaker might be able to examine the types of exceptions that occur and adjust its strategy depending on the nature of these exceptions. For example, it might require a larger number of timeout exceptions to trip the circuit breaker to the Open state compared to the number of failures due to the service being completely unavailable.
  • Logging. A circuit breaker should log all failed requests (and possibly successful requests) to enable an administrator to monitor the health of the operation.
  • Recoverability. You should configure the circuit breaker to match the likely recovery pattern of the operation it's protecting. For example, if the circuit breaker remains in the Open state for a long period, it could raise exceptions even if the reason for the failure has been resolved. Similarly, a circuit breaker could fluctuate and reduce the response times of applications if it switches from the Open state to the Half-Open state too quickly.
  • Testing Failed Operations. In the Open state, rather than using a timer to determine when to switch to the Half-Open state, a circuit breaker can instead periodically ping the remote service or resource to determine whether it's become available again. This ping could take the form of an attempt to invoke an operation that had previously failed, or it could use a special operation provided by the remote service specifically for testing the health of the service.
  • Manual Override. In a system where the recovery time for a failing operation is extremely variable, it's beneficial to provide a manual reset option that enables an administrator to close a circuit breaker (and reset the failure counter). Similarly, an administrator could force a circuit breaker into the Open state (and restart the timeout timer) if the operation protected by the circuit breaker is temporarily unavailable.
  • Concurrency. The same circuit breaker could be accessed by a large number of concurrent instances of an application. The implementation shouldn't block concurrent requests or add excessive overhead to each call to an operation.
  • Resource Differentiation. Be careful when using a single circuit breaker for one type of resource if there might be multiple underlying independent providers. For example, in a data store that contains multiple shards, one shard might be fully accessible while another is experiencing a temporary issue. If the error responses in these scenarios are merged, an application might try to access some shards even when failure is highly likely, while access to other shards might be blocked even though it's likely to succeed.
  • Accelerated Circuit Breaking. Sometimes a failure response can contain enough information for the circuit breaker to trip immediately and stay tripped for a minimum amount of time. For example, the error response from a shared resource that's overloaded could indicate that an immediate retry isn't recommended and that the application should instead try again in a few minutes.

When to use this pattern in the system design interview

  • Use this system design pattern:
    • To prevent an application from trying to invoke a remote service or access a shared resource if this operation is highly likely to fail.
  • This system design pattern isn't recommended:
    • For handling access to local private resources in an application, such as in-memory data structure. In this environment, using a circuit breaker would add overhead to your system.
    • As a substitute for handling exceptions in the business logic of your applications.

Discussing the the retry and circuit breaker patterns comes up frequently in system design and architecture interviews at Google, Microsoft, Facebook, Airbnb, Uber, Salesforce, LinkedIn and Amazon.

Tuesday, February 11, 2020

The Retry Pattern in System Design and Architecture interviews

The Retry pattern enables an application to handle transient failures when it tries to connect to a service or network resource, by transparently retrying a failed operation. This can improve the stability of the application. Explaining the Retry pattern together with the Circuit Breaker pattern come up often during the System Design interview. For infrastructure interviews at Google, Facebook, Amazon, etc. being familiar with these patterns is a strong requirement.

Context and problem

An application that communicates with elements running in the cloud has to be sensitive to the transient faults that can occur in this environment. Faults include the momentary loss of network connectivity to components and services, the temporary unavailability of a service, or timeouts that occur when a service is busy.

It is common to be asked to handle such faults during the system design interview.

These faults are typically self-correcting, and if the action that triggered a fault is repeated after a suitable delay it's likely to be successful. For example, a database service that's processing a large number of concurrent requests can implement a throttling strategy that temporarily rejects any further requests until its workload has eased. An application trying to access the database might fail to connect, but if it tries again after a delay it might succeed.

System Design options and tradeoffs

In the cloud, transient faults aren't uncommon and an application should be designed to handle them elegantly and transparently. This minimizes the effects faults can have on the business tasks the application is performing.

If an application detects a failure when it tries to send a request to a remote service, it can handle the failure using the following strategies:

  • Cancel. If the fault indicates that the failure isn't transient or is unlikely to be successful if repeated, the application should cancel the operation and report an exception. For example, an authentication failure caused by providing invalid credentials is not likely to succeed no matter how many times it's attempted.
  • Retry. If the specific fault reported is unusual or rare, it might have been caused by unusual circumstances such as a network packet becoming corrupted while it was being transmitted. In this case, the application could retry the failing request again immediately because the same failure is unlikely to be repeated and the request will probably be successful.
  • Retry after delay. If the fault is caused by one of the more commonplace connectivity or busy failures, the network or service might need a short period while the connectivity issues are corrected or the backlog of work is cleared. The application should wait for a suitable time before retrying the request.

For the more common transient failures, the period between retries should be chosen to spread requests from multiple instances of the application as evenly as possible. This reduces the chance of a busy service continuing to be overloaded. If many instances of an application are continually overwhelming a service with retry requests, it'll take the service longer to recover.

If the request still fails, the application can wait and make another attempt. If necessary, this process can be repeated with increasing delays between retry attempts, until some maximum number of requests have been attempted. The delay can be increased incrementally or exponentially, depending on the type of failure and the probability that it'll be corrected during this time.

The following diagram can be used in a system design whiteboard interview to illustrate invoking an operation in a hosted service using this pattern. If the request is unsuccessful after a predefined number of attempts, the application should treat the fault as an exception and handle it accordingly.

The application should wrap all attempts to access a remote service in code that implements a retry policy matching one of the strategies listed above. Requests sent to different services can be subject to different policies. Some vendors provide libraries that implement retry policies, where the application can specify the maximum number of retries, the time between retry attempts, and other parameters.

An application should log the details of faults and failing operations. This information is useful to operators. If a service is frequently unavailable or busy, it's often because the service has exhausted its resources. You can reduce the frequency of these faults by scaling out the service. For example, if a database service is continually overloaded, it might be beneficial to partition the database and spread the load across multiple servers.

Issues and considerations to discuss in the system design interview

You should consider the following points when deciding how to implement this pattern.

The retry policy should be tuned to match the business requirements of the application and the nature of the failure. For some noncritical operations, it's better to fail fast rather than retry several times and impact the throughput of the application. For example, in an interactive web application accessing a remote service, it's better to fail after a smaller number of retries with only a short delay between retry attempts, and display a suitable message to the user (for example, “please try again later”). For a batch application, it might be more appropriate to increase the number of retry attempts with an exponentially increasing delay between attempts.

An aggressive retry policy with minimal delay between attempts, and a large number of retries, could further degrade a busy service that's running close to or at capacity. This retry policy could also affect the responsiveness of the application if it's continually trying to perform a failing operation.

If a request still fails after a significant number of retries, it's better for the application to prevent further requests going to the same resource and simply report a failure immediately. When the period expires, the application can tentatively allow one or more requests through to see whether they're successful. We will discuss this strategy called the Circuit Breaker system design pattern in a future blog post.

Consider whether the operation is idempotent. If so, it's inherently safe to retry. Otherwise, retries could cause the operation to be executed more than once, with unintended side effects. For example, a service might receive the request, process the request successfully, but fail to send a response. At that point, the retry logic might re-send the request, assuming that the first request wasn't received. For example, creating an account or processing a payment are not idempotent.

A request to a service can fail for a variety of reasons raising different exceptions depending on the nature of the failure. Some exceptions indicate a failure that can be resolved quickly, while others indicate that the failure is longer lasting. It's useful for the retry policy to adjust the time between retry attempts based on the type of the exception.

Consider how retrying an operation that's part of a transaction will affect the overall transaction consistency. Fine tune the retry policy for transactional operations to maximize the chance of success and reduce the need to undo all the transaction steps.

Ensure that all retry code is fully tested against a variety of failure conditions. Check that it doesn't severely impact the performance or reliability of the application, cause excessive load on services and resources, or generate race conditions or bottlenecks.

Implement retry logic only where the full context of a failing operation is understood. For example, if a task that contains a retry policy invokes another task that also contains a retry policy, this extra layer of retries can add long delays to the processing. It might be better to configure the lower-level task to fail fast and report the reason for the failure back to the task that invoked it. This higher-level task can then handle the failure based on its own policy.

It's important to log all connectivity failures that cause a retry so that underlying problems with the application, services, or resources can be identified.

Investigate the faults that are most likely to occur for a service or a resource to discover if they're likely to be long lasting or terminal. If they are, it's better to handle the fault as an exception. The application can report or log the exception, and then try to continue either by invoking an alternative service (if one is available), or by offering degraded functionality. For more information on how to detect and handle long-lasting faults, look up the Circuit Breaker system design pattern.

When to use this in the system design interview

Use this pattern when an application could experience transient faults as it interacts with a remote service or accesses a remote resource. These faults are expected to be short lived, and repeating a request that has previously failed could succeed on a subsequent attempt.

This pattern might not be useful:

  • When a fault is likely to be long lasting, because this can affect the responsiveness of an application. The application might be wasting time and resources trying to repeat a request that's likely to fail.
  • For handling failures that aren't due to transient faults, such as internal exceptions caused by errors in the business logic of an application.
  • As an alternative to addressing scalability issues in a system. If an application experiences frequent busy faults, it's often a sign that the service or resource being accessed should be scaled up.
Discussing the the retry and circuit breaker patterns comes up frequently in system design and architecture interviews at Google, Microsoft, Facebook, Twitter, Uber and Amazon.

Sunday, February 9, 2020

Binary Search

In this blog post we will not explain what binary search is or how it works. Instead, we will focus on how to detect a binary search solution in a coding interview question and how to implement it properly.

When to binary search in a coding interview?

We can use binary search almost every time we deal with a monotonic function. A monotonic function is a function between ordered sets that preserves or reverses the original order.

A function f(x) = y that satisfies f(x') <= f(x'') for any x' < x'' is monotonic. We used "<=" as the order relationship, but we can also use ">=" or a custom defined order relationship.

f(x) = 3x+1 is monotonically increasing.
f(x) = -x+5 is monotonically decreasing.
a sorted array arr[] = {2, 5, 9, 10, 12, 17} is a monotonically increasing discrete function!

We should consider using binary search whenever we are given a sorted input, a timeline, or generally speaking, a monotonic function.

Some coding interview practice questions could look like this:

Given a sorted array of integers A and a random number Q, find out if Q exists in A.
Find the minimum of a given U-shaped function or the peak in a mountain array.

The textbook binary search is flawed

Let's start with the simple problem of finding a certain number Q in the sorted array A. In the programming interview, most candidates will implement something like this:

def find(arr, q):
  left = 0
  right = len(arr)
  while (left < right):
    mid = (left + right) / 2
    if arr[mid] == q:
      return true
    if arr[mid] > q:
      right = mid - 1
      left = mid + 1
  if right >= 0 and arr[right] == q:
    return true
  if arr[left] == q:
    return true
  return false

The problem with this implementation is that at the end of the while loop we don't know precisely if left == right or left > right. We also don't know if we are still within the boundaries of the array. Let's take a few examples:

[left, right] = [3, 4] => mid = 3
we may end up with [left, right] = [3, 2]

[left, right] = [0, 1] => mid = 0
we may end up with [left, right] = [0, -1]

This is where almost every candidate struggles in the programming interview, and, in most cases fails to handle all the possible edge cases.

Lower bound / upper bound

If, instead of asking to find an exact value Q, we now ask for an upper bound or lower bound (nearest value in the array that is >= Q or <= Q) then we end up with even more complex edge cases.

Specifically, when we move the pointers to mid-1 or mid+1, we leave arr[mid] out of the potential solutions space. But arr[mid] may still be a viable solution. In our interviewing experience, almost every attempt the candidate made to address this has resulted in an infinite loop.

Also, once we are out of the while loop, we won't be able to tell whether arr[left], arr[right], or neither of them is the value we are looking for.

The source of all these corner cases comes from the way we move the pointers:

left = mid + 1
right = mid - 1

Let's try to fix this so that we never leave mid out and we never end up with the confusing end state where left > right.

  • First, keep in mind that whenever we move the pointers, our goal is to generate a smaller interval than the original [left, right]. This is obvious when left and right are far apart, but can be confusing when they get very close.
  • Second, the new binary search interval should be roughly half the size of the previous one (to achieve the logarithmic time complexity).
  • Third, notice that the following relationship stands no matter what:

    left <= mid < right
    or even better:
    left <= mid < mid + 1 <= right

So, if we split the [left, right] interval into [left, mid] and [mid+1, right], then we check all 3 conditions from above, and, according to the third condition, it's guaranteed that left <= right always, and left == right at the end!

Let's implement it:

def find(arr, q):
  left = 0
  right = len(arr)
  while (left < right):
    mid = (left + right) / 2
    if arr[mid] <= q:
      right = mid
      left = mid + 1

  # at this point we know that left == right
  return arr[left] == q

Short, clean, without edge cases! Easy to adapt when looking for an upper or lower bound. Precisely what we need in a software engineering coding interview.

Of course, we can still check and return early if we find arr[mid] == q. However, this is an optimistic optimization and only provides a real benefit in particularly lucky situations.

def find(arr, q):
  left = 0
  right = len(arr)
  while (left < right):
    mid = (left + right) / 2
    if arr[mid] == q:
      return true
    if arr[mid] < q:
      right = mid
      left = mid + 1
  return arr[left] == q

Binary search coding interview practice questions

  • Find first or last occurrence of a given number in a sorted array
  • Count occurrences of a number in a sorted array with duplicates
  • Find smallest missing element from a sorted array
  • Search an element in a circular sorted array