Error classification in multi-component systems

Reproduced below is the the set of lecture notes accompanying my September 27, 2022 talk at an internal seminar of Meta's Distributed Artificial Intelligence group, edited for public viewing.

The below notes assume that the reader has a nodding acquaintance with operating multi-component software system in a distributed setting.


1. Introduction

Fundamental to any software engineering project is error classification, which brings clarity to failures and, consequently, enables the systems and its operators to handle those failures better. All reliability monitoring efforts need a decision on what kind of failures to focus on and who should do something about them. Moreover, efficiency and performance works often involve a balancing act between fault tolerance and resource conservation, which, in turn, requires an understanding of the kind of failures the fault tolerance scheme in question can and cannot handle.

In these notes, we survey the foundations of error classification in multi-component systems. Section 2 introduces a self-contained example of an error classification project. Section 3 expands on the example to discuss typical implementation concerns surrounding error classification projects. Section 4 takes a step back to present a rudimentary metatheory of error classifications, sketching out two ways to classify error classification schemes.

These notes present the author's personal perspectives on the concepts and methods developed over the past year through carrying out a large-scale error classification project at the author's current place of employment. While the author certainly received immense help and inspirations from many collaborators throughout the project, the views expressed herein are the author's own and do not necessarily represent the views of the author's employer.

2. A First Example

Let us begin with an example of an error classification scheme.

Suppose we run a job on a scheduler that supports retries. The scheduler may deem some errors retryable, and this retry behavior may result in the overall successful completion of the job in question. In other scenarios, the job may encounter an error that will not go away by simply retrying.

A transient error is an error that eventually goes away upon retrying, possibly after multiple times of doing so. An intransient error is an error that remains no matter how many times the job is retried.

If we disallow retries, then all errors are final errors, i.e., errors that cause the overarching job to fail. By allowing retries, however, it is possible to turn errors into intermediate errors, i.e., errors that do not cause the overarching job to fail. Jobs that have transient errors as intermediate errors may eventually succeed; those that have intransient errors as intermediate errors will never succeed.

Suppose for now that we are interested in reducing the number of failed jobs. We see that retrying the transient errors will result in fewer failed jobs. It matters less what we do about intransient errors, since they can never result in a successful job. In this scenario, we would want to monitor the intermediate errors and argue for a strict retry policy: if we cannot determine whether an error is transient or intransient, we should disallow retries by default.

The above two goals seem to be in conflict at first sight. But we can, in fact, bring them closer together by reducing the number of errors that we cannot determine to be transient or intransient. In other words, better attribution of errors helps us achieve both goals.

Concretely, we could mark as many exception definitions in a target codebase as possible as transient or intransient. If one pre-existing exception class could be used for both cases, it may be useful to modify the it to raise more granular exceptions that we could easily determine to be transient or intransient. For errors that we cannot determine in advance to be transient or intransient, it may be possible for us to use the information available in the runtime to decide on the fly whether an error is transient or intransient.

With this attribution scheme in place, we could have each error provide an instruction to the underlying scheduler as to whether the job should be retried or not: retry transient errors, and do not retry intransient errors. We would still have to choose the default retry policy for errors that we fail to attribute, but good enough attribution will make this choice less costly.

3. How To Execute An Error Classification Project

In the previous section, we have mapped out an error classification project. Many such projects follow the below recipe:

  1. Come up with definitions for different kinds of errors.
  2. Decide what we could do with such definitions.
  3. Figure out how to attribute the errors.
  4. Use the attributed errors to achieve a goal beyond classify these errors.

Step 4 is crucially important: as with any engineering project, error classification projects must always have clear impact. This, in particular, means that we must always choose classification schemes in service of a concrete goal.

An excellent way to handle Steps 1 and 3 is to design a comprehensive exception hierarchy and to implement it consistently throughout the codebase. This allows us to implement a translation layer separate from business logic that converts error types to desired behaviors, making it easy to maintain.

In practice, we may find ourselves faced with a large codebase that grew in size over time without a consistent error classification scheme. A good way to gradually improve the state of error classification over time is to enable optional addition of error metadata to exception definitions, similar to gradual typing in dynamically-typed programming languages.

Consistency is crucial here. So long as we stick with one well-thought out approach, the implementation itself could be as simple as adding an attribute carrying key-value pairs to exceptions.

Note that no error classification project can succeed without adequate consideration of error propagation issues. Failures can arise anywhere in a system. The error metadata we generate from an error classification scheme must be transported to the part of the system that is equipped to handle such failures, whether it be to log and monitor or to change the subsequent behaviors of the system.

The ground truth is that an error propagation toolchain in a multi-component system can break down at any component. The higher number of components errors must go through in an error propagation toolchain, the more likely the toolchain is to malfunction.

The only effective way out of this conundrum is to drastically reduce the number of components an error must go through. We can achieve this by, for example, introducing a centralized logging and monitoring infrastructure and have each component write to it in a standardized fashion.

This is, of course, substantially cheaper to implement for systems whose designers have seriously considered error propagation matters during the design phase. Implementing an error propagation distance reduction scheme in a system that was not designed for it is substantially more expensive, as it entails rewriting both the underlying logging logic and the components that consume error metadata to determine subsequent system behaviors upon failure, e.g., whether to retry a job.

A practical way to gradually improve the state of error propagation is to build an auxiliary error reporting path that substantially reduces the error propagation distance without making changes to the existing error propagation toolchain in the system. As coverage improves, the auxiliary path will provide additional insight into the system health, collecting and displaying some of the error metadata that the existing error propagation toolchain will invariably miss due to its long propagation distance.

Once we completely cover the system with the new logging system, we can modify the monitoring and alerting infrastructure and the control plane components to consume the error metadata from the auxiliary error reporting path, thereby promoting it from auxiliary to main.

4. Thinking About Error Classification

In the final section of these notes, we introduce two concepts to help streamline error classification projects: scope and temporality. We consider two opposing scopes — local and global error classification — and two opposing temporalities — error classification preprocessing and postprocessing.

All but the tiniest of systems can be decomposed into components, working together to perform the designed duties of the system in question. We assume for simplicity that everything that happens in a system can be attributed to components. With this in mind, every action of a system can be understood from two perspectives: from the perspective of the component in question, and from the perspective of the system as a whole. We call the former the local perspective, and the latter the global perspective.

The locality-globality distinction is useful in error classification as well. To illustrate this, we suppose that we have a simple system consisting of two components.

USER INPUT -> [FRONT END] -> [BACK END] -> OUTPUT

We now suppose that the back-end component produces a failure due to a bad input value. As far as the back-end component is concerned, this is a user error: the bad input came from the outside of the component.

On the other hand, it is possible that the input provided by the user was, in fact, unproblematic, and the failure was caused by the front-end component failing to encode the user input in a format that the back-end component expects. Since the failure was due to the malfunctioning of a system component, this is a system error from the perspective of the entire system.

In other words, this error is locally a user error but globally a system error.

The above distinction matters, for example, when we wish to track the frequency of system errors to monitor the whole system's reliability. For this purpose, we must track system errors from the global perspective. Excluding errors that are user errors only locally will present an unfairly optimistic portrayal of the system's reliability.

Note that the locality-globality distinction is relative. A large system consists of many components, each of which may have several subcomponents. What is local to a system may be global when we consider the relationship between a component and its subcomponents. Consequently, a global error classification scheme for the entire system may be recycled (possibly with minor modifications) to understand the errors arising from one particular component, with the component playing the role of the entire system with respect to its subcomponents.

Let us now discuss when to attribute errors. Broadly speaking, there are only two possible answers to this: before an exception is raised, or after. Let us call an act of error attribution prior to the exception being raised preprocessing, and postprocessing if it happens after the exception is raised.

All acts of preprocessing in error classification eventually reduces to choosing what kind of exceptions to raise. Examples of what we can tweak in preprocessing include the name of the exception class and its place in an exception hierarchy, the error message, and various additional attributes (such as retryability or oncall information) carried by the exception object in question.

Postprocessing, on the other hand, may take many different forms. Temporally speaking, we can categorize various postprocessing options into two buckets: real-time postprocessing that takes place immediately after an exception is raised, and after-the-fact postprocessing that happens afterwards.

The standard of immediately is relatively, of course. It typically makes sense to categorize error classification postprocessing schemes based on the latency requirements of different forms of error handling present in a system.

A termination handler is a typical location for real-time postprocessing. A handler combines error metadata from preprocessing and runtime information to make decisions on how to report the error in question. This process can involve recommendations on subsequent system behavior (e.g., retryability) and so it must be completed rapidly.

More generally, the control plane of a system can serve as an appropriate location for real-time postprocessing. Upon failure, the control plane must make an error attribution decision in order to figure out what to do next. Once again, this decision must be made rapidly, as the subsequent behaviors of the system depend on it.

In contrast, after-the-fact postprocessing typically takes place in the monitoring components of a system. Since the system does not make its behavior decisions based on after-the-fact postprocesing, it is possible to perform high-latency operations such as log parsing to make finer-grained error attribution decisions.

An oncall triaging tool is another typical location for after-the-fact postprocessing. Such a tool may check if there is an explicit oncall configuration in the error metadata. If not, the tool may employ various heuristics such as looking at the owner of the underlying package to determine the oncall. This process is not instantaneous and may require several service calls. But this is OK, as human monitoring and response are never instantaneous, anyway.

Combining scope and temporality allows us to delineate error classification projects. We could, for example, implement the example in Section 2 as a global error classification project with preprocessing to attribute exception definitions as transient or intransient, followed by a real-time postprocessing scheme to convert the attributions as retry vs no-retry instructions to the scheduler. To successfully carry out the error propagation distance reduction scheme sketched in Section 3, in contrast, it would make sense to first emphasize the local error classification angle of it, focusing initially on preprocessing and after-the-fact postprocessing.