Practice Coding and System Design mock interviews with Senior Software Engineers from Silicon Valley

Friday, May 1, 2020

Depth First Search or Breadth First Search?

Graph traversal algorithms are very common in coding interview problems. We'll skip the implementation details as they are quite trivial. In many cases a coding problem can be solved correctly using either DFS or BFS. I then like to ask the candidate why they chose one solution over the other? I expect them to cover the following tradeoffs:

1. Simplicity

Implementing a Depth First Search solution takes fewer lines of code. The recursive implementation takes advantage of an already managed stack, which leads to clean, readable code. Breadth First Search requires us to handle the queue ourselves. This is not difficult at all if we can use a build-in queue implementation (most programming languages provide it as part of the standard libraries).

2. Stack overflow

There's generally a limit on how deep the recursion can go. A recursive DFS implementation can lead to a stack overflow error if the graph has a very large depth. We can, of course, avoid it by implementing the stack manually.

3. Parallelism

Breadth First Search can be parallelized easily. This is particularly important when the exploration of each node far more expensive than O(1). It can be well suitable in large scale system design problems which you can implement using a distributed event queue and a pool of workers. Some great real-world examples (also commonly asked system design interview questions) include a web crawler service or a large image processing algorithm. Resource locking may be needed if the graph is not a tree.
There is no easy way to parallelize a Depth First Search. Even if the graph is a tree, it is still difficult to come up with a good strategy for when to spin up new threads/processes and how to divide the work load.

4. Ability to abort early

Let's assume our task is to find a node with a certain property. Once found, we can stop the traversal and return. Based on our knowledge of the graph shape and/or the property we are looking for, we may be able to decide which traversal algorithm requires visiting fewer nodes. Let's assume, for example, that we are looking for the first/any string that matches a prefix in a trie. DFS will find it visiting a minimal number of nodes, while BFS will end up exploring the entire subtree.



This kind of comparison comes up a lot in the Google coding interview process. They also have a large collection of interview problems involving simple graph traversals.


Tuesday, February 18, 2020

The Circuit Breaker Pattern in System Design and Architecture interviews

Handle faults that might take a variable amount of time to recover from, when connecting to a remote service or resource. This can improve the stability and resiliency of an application. Explaining the Retry pattern together with the Circuit Breaker pattern come up often during the System Design interview. For system design and architecture interviews at Google, Facebook, Uber, Airbnb, etc. being familiar with these patterns is a strong requirement.

Context and problem


In a distributed environment, calls to remote resources and services can fail due to transient faults, such as slow network connections, timeouts, or the resources being overcommitted or temporarily unavailable. These faults typically correct themselves after a short period of time, and a robust cloud application should be prepared to handle them by using a strategy such as the Retry system design pattern.

However, there can also be situations where faults are due to unanticipated events, and that might take much longer to fix. These faults can range in severity from a partial loss of connectivity to the complete failure of a service. In these situations it might be pointless for an application to continually retry an operation that is unlikely to succeed, and instead the application should quickly accept that the operation has failed and handle this failure accordingly.

Additionally, if a service is very busy, failure in one part of the system might lead to cascading failures. For example, an operation that invokes a service could be configured to implement a timeout, and reply with a failure message if the service fails to respond within this period. However, this strategy could cause many concurrent requests to the same operation to be blocked until the timeout period expires. These blocked requests might hold critical system resources such as memory, threads, database connections, and so on. Consequently, these resources could become exhausted, causing failure of other possibly unrelated parts of the system that need to use the same resources. In these situations, it would be preferable for the operation to fail immediately, and only attempt to invoke the service if it's likely to succeed. Note that setting a shorter timeout might help to resolve this problem, but the timeout shouldn't be so short that the operation fails most of the time, even if the request to the service would eventually succeed.

System Design options and tradeoffs


The Circuit Breaker system design pattern can prevent an application from repeatedly trying to execute an operation that's likely to fail. Allowing it to continue without waiting for the fault to be fixed or wasting CPU cycles while it determines that the fault is long lasting. The Circuit Breaker system design pattern also enables an application to detect whether the fault has been resolved. If the problem appears to have been fixed, the application can try to invoke the operation.

The purpose of the Circuit Breaker system design pattern is different than the Retry system design pattern. The Retry pattern enables an application to retry an operation in the expectation that it'll succeed. The Circuit Breaker pattern prevents an application from performing an operation that is likely to fail. In the system design interview we can combine these two patterns by using the Retry pattern to invoke an operation through a circuit breaker. However, the retry logic should be sensitive to any exceptions returned by the circuit breaker and abandon retry attempts if the circuit breaker indicates that a fault is not transient.

A circuit breaker acts as a proxy for operations that might fail. The proxy should monitor the number of recent failures that have occurred, and use this information to decide whether to allow the operation to proceed, or simply return an exception immediately.

The proxy can be implemented as a state machine with the following states that mimic the functionality of an electrical circuit breaker:


  • Closed: The request from the application is routed to the operation. The proxy maintains a count of the number of recent failures, and if the call to the operation is unsuccessful the proxy increments this count. If the number of recent failures exceeds a specified threshold within a given time period, the proxy is placed into the Open state. At this point the proxy starts a timeout timer, and when this timer expires the proxy is placed into the Half-Open state. The purpose of the timeout timer is to give the system time to fix the problem that caused the failure before allowing the application to try to perform the operation again
  • Open: The request from the application fails immediately and an exception is returned to the application.
  • Half-Open: A limited number of requests from the application are allowed to pass through and invoke the operation. If these requests are successful, it's assumed that the fault that was previously causing the failure has been fixed and the circuit breaker switches to the Closed state (the failure counter is reset). If any request fails, the circuit breaker assumes that the fault is still present so it reverts back to the Open state and restarts the timeout timer to give the system a further period of time to recover from the failure.


It is important to discuss the purpose of the Half-Open state during the system design interview. The Half-Open state is useful to prevent a recovering service from suddenly being flooded with requests. As a service recovers, it might be able to support a limited volume of requests until the recovery is complete, but while recovery is in progress a flood of work can cause the service to time out or fail again.

The failure counter used by the Closed state is time based. It's automatically reset at periodic intervals. This helps to prevent the circuit breaker from entering the Open state if it experiences occasional failures. The failure threshold that trips the circuit breaker into the Open state is only reached when a specified number of failures have occurred during a specified interval. The counter used by the Half-Open state records the number of successful attempts to invoke the operation. The circuit breaker reverts to the Closed state after a specified number of consecutive operation invocations have been successful. If any invocation fails, the circuit breaker enters the Open state immediately and the success counter will be reset the next time it enters the Half-Open state.

The Circuit Breaker pattern provides stability while the system recovers from a failure and minimizes the impact on performance. It can help to maintain the response time of the system by quickly rejecting a request for an operation that's likely to fail, rather than waiting for the operation to time out, or never return. If the circuit breaker raises an event each time it changes state, this information can be used to monitor the health of the part of the system protected by the circuit breaker, or to alert an administrator when a circuit breaker trips to the Open state.

The pattern is customizable and can be adapted according to the type of the possible failure and the requirements of the system design interview. For example, you can apply an increasing timeout timer to a circuit breaker. You could place the circuit breaker in the Open state for a few seconds initially, and then if the failure hasn't been resolved increase the timeout to a few minutes, and so on. In some cases, rather than the Open state returning failure and raising an exception, it could be useful to return a default value that is meaningful to the application.

Issues and trade-offs to discuss in the system design interview


You should consider the following points when deciding how to approach this in a system design and architecture interview:

  • Exception Handling. An application invoking an operation through a circuit breaker must be prepared to handle the exceptions raised if the operation is unavailable. The way exceptions are handled will be application specific. For example, an application could temporarily degrade its functionality, invoke an alternative operation to try to perform the same task or obtain the same data, or report the exception to the user and ask them to try again later.
  • Types of Exceptions. A request might fail for many reasons, some of which might indicate a more severe type of failure than others. For example, a request might fail because a remote service has crashed and will take several minutes to recover, or because of a timeout due to the service being temporarily overloaded. A circuit breaker might be able to examine the types of exceptions that occur and adjust its strategy depending on the nature of these exceptions. For example, it might require a larger number of timeout exceptions to trip the circuit breaker to the Open state compared to the number of failures due to the service being completely unavailable.
  • Logging. A circuit breaker should log all failed requests (and possibly successful requests) to enable an administrator to monitor the health of the operation.
  • Recoverability. You should configure the circuit breaker to match the likely recovery pattern of the operation it's protecting. For example, if the circuit breaker remains in the Open state for a long period, it could raise exceptions even if the reason for the failure has been resolved. Similarly, a circuit breaker could fluctuate and reduce the response times of applications if it switches from the Open state to the Half-Open state too quickly.
  • Testing Failed Operations. In the Open state, rather than using a timer to determine when to switch to the Half-Open state, a circuit breaker can instead periodically ping the remote service or resource to determine whether it's become available again. This ping could take the form of an attempt to invoke an operation that had previously failed, or it could use a special operation provided by the remote service specifically for testing the health of the service.
  • Manual Override. In a system where the recovery time for a failing operation is extremely variable, it's beneficial to provide a manual reset option that enables an administrator to close a circuit breaker (and reset the failure counter). Similarly, an administrator could force a circuit breaker into the Open state (and restart the timeout timer) if the operation protected by the circuit breaker is temporarily unavailable.
  • Concurrency. The same circuit breaker could be accessed by a large number of concurrent instances of an application. The implementation shouldn't block concurrent requests or add excessive overhead to each call to an operation.
  • Resource Differentiation. Be careful when using a single circuit breaker for one type of resource if there might be multiple underlying independent providers. For example, in a data store that contains multiple shards, one shard might be fully accessible while another is experiencing a temporary issue. If the error responses in these scenarios are merged, an application might try to access some shards even when failure is highly likely, while access to other shards might be blocked even though it's likely to succeed.
  • Accelerated Circuit Breaking. Sometimes a failure response can contain enough information for the circuit breaker to trip immediately and stay tripped for a minimum amount of time. For example, the error response from a shared resource that's overloaded could indicate that an immediate retry isn't recommended and that the application should instead try again in a few minutes.

When to use this pattern in the system design interview

  • Use this system design pattern:
    • To prevent an application from trying to invoke a remote service or access a shared resource if this operation is highly likely to fail.
  • This system design pattern isn't recommended:
    • For handling access to local private resources in an application, such as in-memory data structure. In this environment, using a circuit breaker would add overhead to your system.
    • As a substitute for handling exceptions in the business logic of your applications.

Discussing the the retry and circuit breaker patterns comes up frequently in system design and architecture interviews at Google, Microsoft, Facebook, Airbnb, Uber, Salesforce, LinkedIn and Amazon.

Tuesday, February 11, 2020

The Retry Pattern in System Design and Architecture interviews

The Retry pattern enables an application to handle transient failures when it tries to connect to a service or network resource, by transparently retrying a failed operation. This can improve the stability of the application. Explaining the Retry pattern together with the Circuit Breaker pattern come up often during the System Design interview. For infrastructure interviews at Google, Facebook, Amazon, etc. being familiar with these patterns is a strong requirement.

Context and problem


An application that communicates with elements running in the cloud has to be sensitive to the transient faults that can occur in this environment. Faults include the momentary loss of network connectivity to components and services, the temporary unavailability of a service, or timeouts that occur when a service is busy.

It is common to be asked to handle such faults during the system design interview.

These faults are typically self-correcting, and if the action that triggered a fault is repeated after a suitable delay it's likely to be successful. For example, a database service that's processing a large number of concurrent requests can implement a throttling strategy that temporarily rejects any further requests until its workload has eased. An application trying to access the database might fail to connect, but if it tries again after a delay it might succeed.

System Design options and tradeoffs


In the cloud, transient faults aren't uncommon and an application should be designed to handle them elegantly and transparently. This minimizes the effects faults can have on the business tasks the application is performing.

If an application detects a failure when it tries to send a request to a remote service, it can handle the failure using the following strategies:


  • Cancel. If the fault indicates that the failure isn't transient or is unlikely to be successful if repeated, the application should cancel the operation and report an exception. For example, an authentication failure caused by providing invalid credentials is not likely to succeed no matter how many times it's attempted.
  • Retry. If the specific fault reported is unusual or rare, it might have been caused by unusual circumstances such as a network packet becoming corrupted while it was being transmitted. In this case, the application could retry the failing request again immediately because the same failure is unlikely to be repeated and the request will probably be successful.
  • Retry after delay. If the fault is caused by one of the more commonplace connectivity or busy failures, the network or service might need a short period while the connectivity issues are corrected or the backlog of work is cleared. The application should wait for a suitable time before retrying the request.


For the more common transient failures, the period between retries should be chosen to spread requests from multiple instances of the application as evenly as possible. This reduces the chance of a busy service continuing to be overloaded. If many instances of an application are continually overwhelming a service with retry requests, it'll take the service longer to recover.

If the request still fails, the application can wait and make another attempt. If necessary, this process can be repeated with increasing delays between retry attempts, until some maximum number of requests have been attempted. The delay can be increased incrementally or exponentially, depending on the type of failure and the probability that it'll be corrected during this time.

The following diagram can be used in a system design whiteboard interview to illustrate invoking an operation in a hosted service using this pattern. If the request is unsuccessful after a predefined number of attempts, the application should treat the fault as an exception and handle it accordingly.



The application should wrap all attempts to access a remote service in code that implements a retry policy matching one of the strategies listed above. Requests sent to different services can be subject to different policies. Some vendors provide libraries that implement retry policies, where the application can specify the maximum number of retries, the time between retry attempts, and other parameters.

An application should log the details of faults and failing operations. This information is useful to operators. If a service is frequently unavailable or busy, it's often because the service has exhausted its resources. You can reduce the frequency of these faults by scaling out the service. For example, if a database service is continually overloaded, it might be beneficial to partition the database and spread the load across multiple servers.

Issues and considerations to discuss in the system design interview

You should consider the following points when deciding how to implement this pattern.

The retry policy should be tuned to match the business requirements of the application and the nature of the failure. For some noncritical operations, it's better to fail fast rather than retry several times and impact the throughput of the application. For example, in an interactive web application accessing a remote service, it's better to fail after a smaller number of retries with only a short delay between retry attempts, and display a suitable message to the user (for example, “please try again later”). For a batch application, it might be more appropriate to increase the number of retry attempts with an exponentially increasing delay between attempts.

An aggressive retry policy with minimal delay between attempts, and a large number of retries, could further degrade a busy service that's running close to or at capacity. This retry policy could also affect the responsiveness of the application if it's continually trying to perform a failing operation.

If a request still fails after a significant number of retries, it's better for the application to prevent further requests going to the same resource and simply report a failure immediately. When the period expires, the application can tentatively allow one or more requests through to see whether they're successful. We will discuss this strategy called the Circuit Breaker system design pattern in a future blog post.

Consider whether the operation is idempotent. If so, it's inherently safe to retry. Otherwise, retries could cause the operation to be executed more than once, with unintended side effects. For example, a service might receive the request, process the request successfully, but fail to send a response. At that point, the retry logic might re-send the request, assuming that the first request wasn't received. For example, creating an account or processing a payment are not idempotent.

A request to a service can fail for a variety of reasons raising different exceptions depending on the nature of the failure. Some exceptions indicate a failure that can be resolved quickly, while others indicate that the failure is longer lasting. It's useful for the retry policy to adjust the time between retry attempts based on the type of the exception.

Consider how retrying an operation that's part of a transaction will affect the overall transaction consistency. Fine tune the retry policy for transactional operations to maximize the chance of success and reduce the need to undo all the transaction steps.

Ensure that all retry code is fully tested against a variety of failure conditions. Check that it doesn't severely impact the performance or reliability of the application, cause excessive load on services and resources, or generate race conditions or bottlenecks.

Implement retry logic only where the full context of a failing operation is understood. For example, if a task that contains a retry policy invokes another task that also contains a retry policy, this extra layer of retries can add long delays to the processing. It might be better to configure the lower-level task to fail fast and report the reason for the failure back to the task that invoked it. This higher-level task can then handle the failure based on its own policy.

It's important to log all connectivity failures that cause a retry so that underlying problems with the application, services, or resources can be identified.

Investigate the faults that are most likely to occur for a service or a resource to discover if they're likely to be long lasting or terminal. If they are, it's better to handle the fault as an exception. The application can report or log the exception, and then try to continue either by invoking an alternative service (if one is available), or by offering degraded functionality. For more information on how to detect and handle long-lasting faults, look up the Circuit Breaker system design pattern.

When to use this in the system design interview

Use this pattern when an application could experience transient faults as it interacts with a remote service or accesses a remote resource. These faults are expected to be short lived, and repeating a request that has previously failed could succeed on a subsequent attempt.

This pattern might not be useful:


  • When a fault is likely to be long lasting, because this can affect the responsiveness of an application. The application might be wasting time and resources trying to repeat a request that's likely to fail.
  • For handling failures that aren't due to transient faults, such as internal exceptions caused by errors in the business logic of an application.
  • As an alternative to addressing scalability issues in a system. If an application experiences frequent busy faults, it's often a sign that the service or resource being accessed should be scaled up.
Discussing the the retry and circuit breaker patterns comes up frequently in system design and architecture interviews at Google, Microsoft, Facebook, Twitter, Uber and Amazon.

Sunday, February 9, 2020

Binary Search

In this blog post we will not explain what binary search is or how it works. Instead, we will focus on how to detect a binary search solution in a coding interview question and how to implement it properly.

When to binary search in a coding interview?


We can use binary search almost every time we deal with a monotonic function. A monotonic function is a function between ordered sets that preserves or reverses the original order.

A function f(x) = y that satisfies f(x') <= f(x'') for any x' < x'' is monotonic. We used "<=" as the order relationship, but we can also use ">=" or a custom defined order relationship.

f(x) = 3x+1 is monotonically increasing.
f(x) = -x+5 is monotonically decreasing.
a sorted array arr[] = {2, 5, 9, 10, 12, 17} is a monotonically increasing discrete function!

We should consider using binary search whenever we are given a sorted input, a timeline, or generally speaking, a monotonic function.



Some coding interview practice questions could look like this:

Given a sorted array of integers A and a random number Q, find out if Q exists in A.
Find the minimum of a given U-shaped function or the peak in a mountain array.

The textbook binary search is flawed


Let's start with the simple problem of finding a certain number Q in the sorted array A. In the programming interview, most candidates will implement something like this:

def find(arr, q):
  left = 0
  right = len(arr)
  while (left < right):
    mid = (left + right) / 2
    if arr[mid] == q:
      return true
    if arr[mid] > q:
      right = mid - 1
    else
      left = mid + 1
  
  if right >= 0 and arr[right] == q:
    return true
  if arr[left] == q:
    return true
  return false


The problem with this implementation is that at the end of the while loop we don't know precisely if left == right or left > right. We also don't know if we are still within the boundaries of the array. Let's take a few examples:

[left, right] = [3, 4] => mid = 3
we may end up with [left, right] = [3, 2]

[left, right] = [0, 1] => mid = 0
we may end up with [left, right] = [0, -1]

This is where almost every candidate struggles in the programming interview, and, in most cases fails to handle all the possible edge cases.

Lower bound / upper bound


If, instead of asking to find an exact value Q, we now ask for an upper bound or lower bound (nearest value in the array that is >= Q or <= Q) then we end up with even more complex edge cases.

Specifically, when we move the pointers to mid-1 or mid+1, we leave arr[mid] out of the potential solutions space. But arr[mid] may still be a viable solution. In our interviewing experience, almost every attempt the candidate made to address this has resulted in an infinite loop.

Also, once we are out of the while loop, we won't be able to tell whether arr[left], arr[right], or neither of them is the value we are looking for.

The source of all these corner cases comes from the way we move the pointers:

left = mid + 1
right = mid - 1

Let's try to fix this so that we never leave mid out and we never end up with the confusing end state where left > right.

  • First, keep in mind that whenever we move the pointers, our goal is to generate a smaller interval than the original [left, right]. This is obvious when left and right are far apart, but can be confusing when they get very close.
  • Second, the new binary search interval should be roughly half the size of the previous one (to achieve the logarithmic time complexity).
  • Third, notice that the following relationship stands no matter what:

    left <= mid < right
    or even better:
    left <= mid < mid + 1 <= right

So, if we split the [left, right] interval into [left, mid] and [mid+1, right], then we check all 3 conditions from above, and, according to the third condition, it's guaranteed that left <= right always, and left == right at the end!

Let's implement it:

def find(arr, q):
  left = 0
  right = len(arr)
  while (left < right):
    mid = (left + right) / 2
    if arr[mid] <= q:
      right = mid
    else
      left = mid + 1

  # at this point we know that left == right
  return arr[left] == q

Short, clean, without edge cases! Easy to adapt when looking for an upper or lower bound. Precisely what we need in a software engineering coding interview.

Of course, we can still check and return early if we find arr[mid] == q. However, this is an optimistic optimization and only provides a real benefit in particularly lucky situations.

def find(arr, q):
  left = 0
  right = len(arr)
  while (left < right):
    mid = (left + right) / 2
    if arr[mid] == q:
      return true
    if arr[mid] < q:
      right = mid
    else
      left = mid + 1
  return arr[left] == q

Binary search coding interview practice questions


  • Find first or last occurrence of a given number in a sorted array
  • Count occurrences of a number in a sorted array with duplicates
  • Find smallest missing element from a sorted array
  • Search an element in a circular sorted array

Thursday, January 23, 2020

Gateway Routing system design pattern

Routing requests to multiple services using a single endpoint. This pattern is useful when you wish to expose multiple services on a single endpoint and route to the appropriate service based on the request. The pattern may come in handy in system design interview problems.

Context and problem


When a client needs to consume multiple services, setting up a separate endpoint for each service and having the client manage each endpoint can be challenging. For example, an e-commerce application might provide services such as search, reviews, cart, checkout, and order history. Each service has a different API that the client must interact with, and the client must know about each endpoint in order to connect to the services. If an API changes, the client must be updated as well. If you refactor a service into two or more separate services, the code must change in both the service and the client.

Despite the differences between them, all these API endpoints have some common requirements like a secure connection, session and/or authentication token management, rate limiting, request tagging and header validations, etc.

Solution


Place a gateway in front of a set of applications, services, or deployments. Use application Layer 7 routing to route the request to the appropriate instances.

With this system design pattern, the client application only needs to know about and communicate with a single endpoint. If a service is consolidated or decomposed, the client does not necessarily require updating. It can continue making requests to the gateway, and only the routing changes.

A gateway also lets you abstract backend services from the clients, allowing you to keep client calls simple while enabling changes in the backend services behind the gateway. Client calls can be routed to whatever service or services need to handle the expected client behavior, allowing you to add, split, and reorganize services behind the gateway without changing the client.

Diagram of the Gateway Routing pattern

This system design pattern can also help with deployment, by allowing you to manage how updates are rolled out to users. When a new version of your service is deployed, it can be deployed in parallel with the existing version. Routing lets you control what version of the service is presented to the clients, giving you the flexibility to use various release strategies, whether incremental, parallel, or complete rollouts of updates. Any issues discovered after the new service is deployed can be quickly reverted by making a configuration change at the gateway, without affecting clients.

Issues and considerations


  • The gateway service may introduce a single point of failure. Ensure it is properly designed to meet your availability requirements. Consider resiliency and fault tolerance capabilities when implementing.
  • The gateway service may introduce a bottleneck. Ensure the gateway has adequate performance to handle load and can easily scale in line with your growth expectations.
  • Perform load testing against the gateway to ensure you don't introduce cascading failures for services.
  • Gateway routing is level 7. It can be based on IP, port, header, or URL.


When to use this system design pattern


Use this system design pattern when:
  • A client needs to consume multiple services that can be accessed behind a gateway.
  • You wish to simplify client applications by using a single endpoint.
  • You need to route requests from externally addressable endpoints to internal virtual endpoints, such as exposing ports on a VM to cluster virtual IP addresses.
  • This system design pattern may not be suitable when you have a simple application that uses only one or two services.


Example


The gateway is commonly implemented as a proxy layer at the same level where the load balancer sits. In Kubernetes for example you could use an Ingress controller as a gateway service.

Using Nginx as the router, the following is a simple example configuration file for a server that routes requests for applications residing on different virtual directories to different machines at the back end.

server { listen 80; server_name domain.com; location /app1 { proxy_pass http://10.0.3.10:80; } location /app2 { proxy_pass http://10.0.3.20:80; } location /app3 { proxy_pass http://10.0.3.30:80; } }


To better prepare for the system design interview, we recommend reading these articles about industry proven implementations of the gateway system design pattern at scale:



Fibonacci and Matrix Exponentiation

Problem statement
The Fibonacci numbers, commonly denoted Fn form a sequence, called the Fibonacci sequence, such that each number is the sum of the two preceding ones, starting from 0 and 1. That is
and
for n > 1.
The beginning of the sequence is thus:


A common coding interview question is to compute the n-th Fibonacci number.

Naive solution

A naive way of implementing F(n) is to use recursion:

def fibonacci(n):
  if n <= 1:
    return n
  return fibonacci(n - 1) + fibonacci(n - 2)

Such an implementation leads to an exponential runtime. The issue is that we compute the same values over and over again. Let's take a look at the recursion tree:


We can quickly estimate an upper bound for the time complexity as O(2^n). A tight bound Θ((1+sqrt(5)/2)^n) = (Fib(n)) can be determined by using generating functions and the golden ratio.

Linear solution

We can easily improve the previous solution by using memorization to avoid recomputing the same value multiple times:

fib = {
  0: 0,
  1: 1
}

def fibonacci(n):
  if n in fib:
    return fib[n]

  fib[n] = fibonacci(n - 1) + fibonacci(n - 2)
  return fib[n]

The time complexity is now linear O(n), but we also using O(n) memory. Let's rewrite the solution iteratively:

def fibonacci(n):
  fib = [0, 1]
  for i in xrange(2, n + 1):
    fib.append(fib[i - 1] + fib[i - 2])
  return fib[n]

Note how at each step we only need to look at the previous two values. Thus we can achieve the same result with constant O(1) additional memory:

def fibonacci(n):
  if n <= 1:
    return n

  tmp0 = 0
  tmp1 = 1
  for i in xrange(2, n + 1):
    curr = tmp0 + tmp1
    tmp0 = tmp1
    tmp1 = curr
  return tmp1

Optimal solution

The linear solution is the one described in most coding interview training resources. But can we do better? As it turns out we can actually solve the problem in logarithmic time. Let's see how.

We will rewrite the Fibonacci formula using simple matrix algebra:


which is also equivalent to

now all we need to do is compute the matrix exponentiation which can be implemented in O(logN) time. The pseudocode implementation is not too difficult either:


power(result_matrix, base_matrix, exponent):
  if exponent == 1:
    result_matrix = base_matrix.copy()
    return

  if exponent % 2 == 1:
    power(result_matrix, base_matrix, exponent - 1)
    result_matrix = result_matrix * base_matrix
  else
    power(result_matrix, base_matrix, exponent / 2)
    result_matrix = result_matrix * result_matrix

fibonacci(n):
  base_matrix = [ [1,1], [1, 0] ]
  result_matrix = []
  power(result_matrix, base_matrix, n)

  return result_matrix[0][1]


Solving linear recurrence problems

linear recurrence relation is an equation that defines the n^\text{th} term in a sequence in terms of the k previous terms in the sequence. The recurrence relation is in the form:
x_n=c_1x_{n-1}+c_2x_{n-2}+\cdots+c_kx_{n-k}
Where each c_i is a constant coefficient.

The Fibonacci problem is a particular case of a linear recurrence of a 2nd degree with both coefficients equal to 1. The matrix exponentiation solution can be used in solving any linear recurrence problems. For example if we had to solve:

xn=6xn112xn2+8xn3

then we can build the matrix:

[  6,  -12,  8  ]   [ Xn-1 ]   [ Xn   ]
[  1,   0,    0 ] x [ Xn-2 ] = [ Xn-1 ]
[  0,   1,    0 ]   [ Xn-3 ]   [ Xn-2 ]

For a recurrence of k-th degree the matrix multiplication takes O(k^3) and the exponentiation takes O(logN) time, therefore we can solve the general problem in O(k^3 * logN) time.

Note

The logarithmic solution is only viable when we need to compute a single value for a given n. If, for example, we are asked to compute all the Fibonacci numbers up to and including n, then the linear solution is the obvious right choice.