A Introduction to Scalability, Reliability, and Maintainability


1. Introduction

In this article, we'll revisit three essential concepts to keep in mind when building backend systems in data-intensive environments.

2. Size and Complexity of Datasets

Data can grow from two perspectives: size and complexity.

  • Data size growth means that the amount of data is more considerable. Thus, more bytes of the same data type enter the system.
  • Data complexity means that the type of data has changed. For example, processing a text file is more straightforward when compared to processing a video file. Therefore, video files are more complex than text files.

3. Data-Intensive vs. Compute Intensive

There are two kinds of environments where computers are predominantly needed:

  • Data-Intensive, when data input size and complexity grows.
  • Compute Intensive, when the cost of computing a single operation with the same input data grows.

The issue nowadays is creating systems that execute simple algorithms, for example, passing data from a web page to the database. Systems should continue to operate efficiently as the input data grows in size and complexity. This article will focus on data-intensive systems because it is the most typical software industry issue.

4. Reliability, Scalability, and Maintainability

We want to achieve three goals when building data-intensive systems: reliability, scalability, and maintainability.

  • Reliable systems should continue to work correctly even in the face of adversity.
  • Scalable systems smoothly handle the input data's growth in size and complexity.
  • Over time, different people that work on a maintainable system should be able to work on it productively.

Those are just simplistic definitions of our goals. In the following sections, we'll take a closer look at each.

5. Reliability

Building a reliable system is essential to any business. From the critical ones like air traffic control, the stock market, power plants, and banks to the non-critical ones like e-commerces and video streaming, the system impacts the customer.

Usually, we divide the concept of a system into many error-sensitive parts. It's impossible to reduce the probability of an error in a system part to zero. Thus, the fundamental idea is to make the system reliable from unreliable parts.

5.1. Types of System Failures

There are typically three sorts of errors that might cause a system part to fail:

  • Hardware errors: Hardware is unreliable, especially storage disks. Storage disks often fail at data clusters (check out Backblaze's disk failures for 2021 Q1).

  • Software errors: Unexpected bugs can arise from code for many reasons, like false assumptions about the use cases and lack of testing. Bugs in one part of the system can induce bugs in other parts, leading to a cascading failure scenario.

  • Human errors: Humans are more unreliable than machines. Statistics say that 75% of business outages happen due to wrong system configuration. See, for example, a case of an outage at Meta.

5.2. Practices to Mitigate System Failures

It's impossible to reduce the chance of a system failure of any kind to zero, but there are some ways to mitigate its probability:

  • Set up a test environment similar to production to be exercised like a sandbox, and test the system parts integration thoroughly to identify potential edge cases.
  • Configure fast and easy production release rollbacks.
  • Make APIs that restrict humans from doing wrong actions and encourage them to do the correct actions.
  • Isolate the system's parts as much as possible to make them loosely coupled.
  • Extensively use observability tools to monitor systems behavior.
  • Create hardware replicas to take care of a particular task if other hardware fails.

The idea of building a reliable system is not to avoid all of them. But to exercise them and monitor how the system will react under these events. One good example of error exercising is the Netflix Chaos Automation Platform (ChAP).

5.3. Thoughts on Reliability

Making systems reliable generates costs since building hardware replicas, hiring specialized people, and setting up monitoring systems are not free. Businesses should balance the reliability and expenses of a system by measuring its importance. If that doesn't have to be active continuously, we should reevaluate if we should invest in reliability.

Companies that are starting their business should think twice before investing in reliability. Usually, startups should care about other things like marketing, designing and prototyping the product, and capturing investors. Thus, investing in reliability at this point might make the company's process slower and more expensive.

6. Scalability

Scalability is the power of a system to keep working smoothly when input data grows in size and complexity. Two essential concepts regarding scalable systems are load and performance.

6.1. Load and Performance

  • Load parameters calculate the load. Some examples of frequently used load parameters are reads/writes per second in databases, requests per minute, and the number of active users.

  • Performance is a number that tells how the system functions when your load parameters increase. Frequently used performance metrics are latency, throughput, and response time.

Performance metrics are crucial to confirm if a system can handle the load. The three most used performance metrics are:

  • Throughput is the number of requests that a system processes in one unit of time.
  • Latency tells how long a request takes to be transmitted to a system.
  • Response time is how much time the system needs to receive and process one request.

Latency and response times are similar except for one detail. While latency only measures the request's delivery time, the response time measures the delivery time plus the processing time.

Factors like network, software, and hardware failures might bog down requests. Thus, we should consider a more significant period to measure system performance accurately.

We can represent performance metrics as a distribution of values to create trusty statistics about our systems. One statistic tool is very effective when working with distributions: the percentile. In short, percentiles tells that at least X percent of the observed values are smaller than a threshold Y. That threshold Y is also known as the X percentile, or pX. Let's work with an example to be more precise:

Suppose you have a non-functional requirement to make an API that at least 95% of the requests don't take more than 500ms to process. In that case, 500ms is our 95 percentile or p95. After building the API, we observed the following response times:

instant 0 1 2 3 4 5 6 7 8 9 10 11 12 13
ms 355 389 424 450 822 730 481 492 393 386 381 351 326 415

At least 8% of observed requests took more than 500ms to process, and at least 92% of requests took lesser than 500ms to process. Therefore, we infer that this API doesn't meet the requirement of having at least 95% of the requests taking less than 500ms to process. Regarding response time, this API is not a p95 but a p92.

The statistics use a small sampling with 13 observations. In real cases, we should add more observations to create adequate statistics.

6.2. Scale Toward a Data-Intensive System

After evaluating the system's performance under that load, it's time to scale it. Usually, we scale systems from two perspectives, vertically or horizontally:

  • Vertical scaling means requesting more resources like RAM, CPU, network, and disk to a single machine.
  • Horizontal scaling means adding more machines of the same size to an existing pool of machines.

These two techniques are not exclusive and can work perfectly together. Typically, vertical scaling is the best option for premature systems due to distributed systems issues like the lack of consistency. However, it's only practical to scale vertically for a while. When vertical scaling is not enough to handle the current load, then consider adopting horizontal scaling.

Scaling systems, either horizontally or vertically, can be done in two ways:

  • Manual scaling means that a human should analyze the load and performance metrics and add more resources if needed.
  • Elastic scaling means that another machine automatically adds more resources as needed.

Modern systems are becoming big enough to make manual scaling impractical for two reasons:

  1. As previously mentioned, humans are unreliable. Thus, they might introduce bugs in an attempt to scale a system.
  2. Changing machine resource configurations can become tedious and complex. Engineers should be doing more creative and higher-value activities.

For those reasons, companies are adopting Elastic Computing. In Elastic Computing, a human sets the initial values and limits for computer resources, and the cloud provider increases or decreases the resources as the load changes. Cloud providers like Amazon Web Services automatically scale the machines, most of the time better than humans.

6.3. Thoughts On Scalability

Ultimately, we can follow a few steps to build scalable systems:

  1. Understand the architecture you're designing and make assumptions about the load parameters.
  2. Have a way to measure the system's performance under those load parameters.
  3. Scale vertically and/or horizontally based on the observed performance.

It is essential to mention that assumptions can be wrong. We might misconceive the entire architecture and load parameters in the systems design phase. Making false assumptions is counterproductive and time-wasting.

Just like reliability, scalability has costs. So, it's not worth considering it at the beginning of a business. Building scalable systems should be something that comes from necessity as the business grows.

7. Maintainability

Different people will work on the same code at most companies over time. These people should be able to work on that system productively, or in other words, maintain it. Maintainability is the capacity of a system to stay simple to maintain as time passes.

We can define maintainability as the sum of the Operability, Simplicity, and Evolvability pillars.

7.1. Operability

This pillar defines that your system is easy to run from the operations point of view. For example, how simple the system is to run in an infrastructure. That pillar is vital to simplify operations tasks and let the operations team focus on higher-value activities.

There are some ways to achieve Operability in your system:

  • Provide default behaviors. For instance, if the system cannot complete a request, there's a good default behavior to use instead of an error?
  • Implement Self-healing systems. For example, if the application is down, what should be the server behavior to make it alive again?
  • Exhibit predictable behaviors through observability tools such as logs and events.

7.2. Simplicity

Simplicity means that the code is simple and easy to understand—simple means to keep things as straightforward as possible and avoid unnecessary complexity.

Some signals of software complexity are:

  • Special cases everywhere as a workaround to solve bugs.
  • Dependency issues as a form of Dependency Hell.
  • Inconsistent naming of classes, variables, and methods.
  • Unnecessary design patterns or algorithms to solve a problem, as one form of overengineering.

Engineers should find potential signals of complexity and try to avoid them. There's no easy way to do that. But, there are some agile principles of coding that helps to write simpler code like YAGNI, KISS, and DRY.

Creating and using software abstractions can also help to cover implementation details. Abstractions are helpful in protecting any potential complex algorithm or patterns, which lets the developer focus only on the caller code.

One of the best examples of abstractions is a programming language. Programming languages create abstractions for complex operations like disk I/O, RAM management, and code compiling. Usually, they make a Façade for complex algorithms and programming paradigms, which lets the caller focus only on the caller code. See SQL, for instance. As a high-level language, SQL abstracts complex algorithms and data structures such as B Trees, concurrency control, and cache management.

7.3. Evolvability

Evolvability means making the code open and accessible to make changes in the future. In other words, make it capable of easily adapting to new use cases.

It is unlikely that the system will stay the same as time passes. New use cases appear, assumptions wrongly made must be corrected, and incompatibility with other software shows up. Thus, most of the code is developed in the maintenance phase, making this phase the more expensive one.

There's no easy way to predict if a piece of code will change in the future. The rule of thumb is to write code that is open to new use cases, easy to change, and clear enough for future developers.

Good planning about your system can also help to make the software evolvable. Agile methodologies intentionally use a framework that makes it easy for the system adapts to changes. In agile environments, we incrementally add small chunks of code to create new things. Thus, it is easier to plan the whole picture and make all chunks evolve into a system.

We can use a framework such as Scrum and a few agile principles that help to write evolvable code like TDD and Refactoring.

8. Conclusion

In this article, we revisited three non-functional requirements to build productive and valuable software.

Building systems is challenging, especially in data-intensive environments. There are a lot of variables that affect the system's overall performance. We must continually evaluate the context and create a system design that solves a particular problem for the company instead of trying to achieve the three pillars simultaneously.

Some patterns keep appearing to achieve those three pillars' goals. These patterns help us to design systems. In the following posts of this series, we'll look deeper at some of these patterns and which problems they solve.