Understanding "Failed Open" and "Fail Closed" in Software Engineering

Understanding "Failed Open" and "Fail Closed"

"Failed Open" or "Fail Closed" are common concepts that come up while speaking with software engineers. Maybe you are also a software engineer and it came up in code review or maybe the engineer just used the terminology in a brand new context. This is exactly what makes these concepts interesting: they can be applied outside of code as well as describe code in extremely fine detail.

The crux of both concepts is to consider the failure scenario. If we visualize the gate of a castle, we can apply the two concepts.

When something "fails open", it means that, when something unaccounted for occurs, the gates are open and anything can get inside.
When something "fails closed", it means that, when something unaccounted for occurs, the gates remain closed and nothing can get inside.

You may have realized that one of these scenarios is more preferable to the other. Why in the world would something fail-open when it could fail-closed? The answer lies in the control flow of programming languages.

Code Examples: Fail-Open vs. Fail-Closed

Let's take a look at some code:

# Example A
if not user.is_allowed():
    raise NotAllowedError()
do_more_work_here()
return response

# Example B
if user.is_allowed():
    do_more_work_here()
    return response
raise NotAllowedError()

Can you tell which of these is fail-open vs fail-closed?

Example A is fail-open. Example B is fail closed.

The Implication of Unhandled Scenarios

Consider the scenario when the is_allowed() method doesn't support every failure scenario. What would happen when the unhandled failure scenario occurs? In Example B, execution would continue and I would get an error. However, in Example A, execution would continue and I would ultimately get a response!

In this small example, you might be wondering why anyone would be tempted to write code in the fail-open style, but imagine that you have hundreds of lines of code where do_more_work_here() is called. Each failure scenario makes the successful execution path indent one level further:

if not request.is_valid():
   work = do_more_work_here()
    if work.was_successful():
        user = lookup_user()
        if user.is_allowed():
            return response
raise NotAllowedError()

You can combine checks like so:

if request.is_valid():
    work = do_more_work_here()
    user = lookup_user()
    if work.was_successful and user.is_allowed():
        return response
raise NotAllowed()

However, if functions like do_more_work_here() or lookup_user() are expensive or contains side-effects, you'd want to run the function only once you filtered all the requests where the work wasn't successful. Fail-closed code can end up awful to read: sometimes so awful that you might be more likely to write bugs because it's too hard to read. This is why you must decide for yourself where to risk writing something fail-open or fail-closed.

The next time you're left wondering why a senior developer's code is nested deeply in something critical, consider that they were trying to have the code fail-closed.

Additional Reading

If you’re interested in learning more about Authorization and Google Zanzibar, we recommend reading the following posts:

Understanding "Failed Open" and "Fail Closed" in Software Engineering