Robust and Scalable Applications

This article draws insights from Chapter 1 of "Designing Data-Intensive Applications" to explore how modern applications often prioritize data management over computational power. Data-intensive applications are constructed using standard building blocks that deliver essential functions, such as:

Databases: Store data so they can retrieve it later
Caches: remember the result of an expensive operation, to speed up reads
Search Indexes: Allow users to search data by keyword or filter it in various ways
Batch Processing: Periodically crunch a large amount of accumulated data

Questions to ask for making Data Systems

For making an application we have to decide which tools and approached to use for the task at hand. This is because in the recent years firstly many new tools have emerged and secondly many applications now have such high demanding requirements that a single tool cannot meet meet all of its data processing and storage needs. Instead, the work is broken down into tasks that can be performed efficiently on a single tool and those different tools are stitched together using application code.

when designing a data systems many questions arise such as

How do you ensure the data remains correct and complete, even when things go wrong internally?
How do you consistently provide good performance to clients, even when parts of your system are degraded?
How do you scake to handle an increase in load?
What does a good API for the service look like?

Factors that Influence the Design of a Data System

Factors that influence the design of a data system are:

skills and experience of people involved
legacy system dependencies
time-scale for delivery
your organizations tolerance of different kinds of risk
regulatory constraints

Three most Important concerns for Data Systems

Reliability

The system should continue to work correctly even when it encounters hardware or software faults or even human error.
Scalability

As the system grows in data column, traffic volume, or complexity, their should be reasonable ways of dealing with that growth
Maintainability

Over time, many different people will work on the system (engineering and operations both maintaining current behavior and adapting the system to new use cases), and they should all be able to work on it productively

Now lets disscuss these topics in detail.

Reliability

Reliability means that "continuing to work correctly, even when things to wrong". For example the typical expectations from a reliable system include:

The applications performs the Function that the user expected.
It can tolerate the user making mistakes or using the software in unexpected ways.
Its performance is good enough for the required use case, under the expected load and data colume.
The system prevents any unauthorized access and abuse.

Faults

Things that can go wrong are are called faults, and system that can anticipate faults and can cope with them are called fault tolerant or resilient. A fault is not the same as a failure. A fault is usually defined as one component of a system deriviating from its spec, whereas a failure is when the system as a whole stops providing the required service to the user.

Hardware Faults are one type of faults. They include

Hard Dish crashes
RAM becomes faulty
the power grid has a blackout
someone unplugs the wrong cable

We can solve most of these problem by adding redundancy to the individual hardware components in order to reduce the failure rate of the system.

Software Errors unlike hardware faults that appear to be random may be caused by systematic error. some examples include:

A software bug that causes every instance of an application server to crash when given a particular bad input. For example, consider the leap second on June 30, 2012, that caused many applications to hang simultaneously due to a bug in the Linux kernel .
A runaway process that uses up some shared resource —CPU time, memory, disk space, or network bandwidth.
A service that the system depends on slows down, becomes responsive or starts returning corrupted responses
cascading failures where a small fault in one component triggers a fault in another component, which in turn triggers a fault in another component, which in turn triggers further faults.

Human Errors are leading cause of outages, whereas hardware faults played a role in only 10-25%. SO how can we make reliable systems when they are designed, build and operated by humans who may be unreliable. Below are some of the approaches.

Design systems in a way that minimizes the opprtunities for error i.e well-designed abstractions, API's and admin interfaces makes it easy "to right thing" and discourage "the wrong thing"
Decouple the places where people make most mistakes from the places where they can cause failures.
Test thoroughly at levels from unit tests to whole system integration tests and manual tests.
Allow quick and easy recovery from human errors, to minimize the impact in the case of a failure. For example, make it fast to roll back configuration changes, roll out new code gradually (so that any unexpected bugs affect only a small subset of users), and provide tools to recompute data (in case it turns out that the old computation was incorrect).
Implement good management practices and training.
set up clear and detailed monitoring such as performance metrics and error rates.

How important is Reliability ?

Reliability is essential not only in critical systems like nuclear power stations but also in everyday applications, as bugs in business software can lead to lost productivity, legal risks, and reputational damage. Even in noncritical applications, users rely on us to protect important data, such as a parent storing family memories in a photo app. While there may be cases where we prioritize cost savings over reliability, such as in prototype development or low-margin services, it's crucial to be aware when cutting corners and understand the potential impact on users.

Scalability

Scalability refers to a system's ability to handle increased load, such as more users or data. It's not a fixed trait, but a question of how the system can adapt and what options are available to manage growth by adding resources.

Describing Load

Load refers to the current demand on a system, typically measured by key parameters like requests per second, read/write ratios, active users, or cache hit rates. The specific load metrics depend on the system's architecture and can vary based on average or extreme usage cases.

For example: In 2012, Twitter's two key operations were posting tweets (handling an average of 4.6k requests per second) and viewing the home timeline (handling 300k requests per second). Initially, Twitter used a relational database approach for fetching tweets, where posting a tweet inserted it into a global pool, and reading the home timeline involved querying and merging tweets from followed users. However, as the load grew, this method became inefficient, particularly due to the fan-out effect where each user follows multiple accounts, making the home timeline reads more demanding than posting tweets.

To address this, Twitter switched to an approach where each user’s home timeline was cached in advance. When a user posted a tweet, it was immediately distributed to all their followers’ timelines. While this reduced the load during read operations, it increased the write load significantly. For example, with an average of 75 followers per user, 4.6k tweets per second transformed into 345k writes per second. For users with millions of followers, this required even more resources to deliver tweets quickly. To manage scalability challenges, Twitter adopted a hybrid model, where most tweets are pre-distributed, but tweets from celebrities with many followers are fetched only when a user views their timeline, balancing performance across different user types.

Describing Performance

When investigating load increases on a system, two key questions arise:

Impact of Increased Load: How does system performance change with increasing load when resources (CPU, memory, etc.) remain the same?
Resource Needs to Maintain Performance: How much do resources need to be increased to maintain performance as load increases?

In batch processing systems like Hadoop, performance is typically measured by throughput (e.g., records processed per second). In online systems, the key metric is response time (the time from request to response).

Latency vs. Response Time:
- Latency: The time a request waits before being processed.
- Response Time: The total time for a request to be processed, including service time, network delays, and queuing delays.

Response times vary, so metrics are often presented as distributions. The median (50th percentile) shows the time within which half of the requests are processed. Higher percentiles (e.g., 95th, 99th) measure the slowest requests, which significantly impact user experience.

Amazon Example: Amazon targets the 99.9th percentile for response time, focusing on the slowest 1 in 1,000 requests. This is crucial because slow requests often come from valuable customers with extensive purchase histories. A 100 millisecond increase in response time can reduce sales by 1%, and a 1-second delay can decrease customer satisfaction by 16%. Although optimizing for the 99.99th percentile can be prohibitively expensive, ensuring fast response times for the 99.9th percentile helps maintain customer satisfaction and sales.

Approaches for Coping with Load

Approaches for Coping with Load

To manage increasing load while maintaining performance, architectures must adapt as load scales up. A system designed for a small load might not handle a tenfold increase effectively, so frequent reevaluation and redesign might be necessary as load grows.

Scaling Strategies:

Vertical Scaling (Scaling Up): Involves upgrading to more powerful machines. While simpler, this can become expensive and impractical for very high loads.
Horizontal Scaling (Scaling Out): Distributes the load across multiple machines. Known as a shared-nothing architecture, this approach can be more cost-effective for intensive workloads, even though it adds complexity.

Elastic vs. Manual Scaling:

Elastic Systems: Automatically adjust resources based on load changes. Useful for unpredictable workloads.
Manually Scaled Systems: Require human intervention to add resources. Simpler but potentially less responsive to sudden changes.

Stateful vs. Stateless Services:

Distributing stateless services is relatively straightforward, but scaling stateful data systems introduces significant complexity. Traditionally, databases were kept on a single node until high availability or cost required distribution.

Future Trends:

Advances in distributed system tools may shift the norm toward distributed databases even for smaller workloads. The book will explore various distributed systems' scalability, ease of use, and maintainability.

Application-Specific Architectures:

There is no one-size-fits-all scalable architecture. Effective scaling is built on understanding the specific application's load parameters, such as request volume and size, which vary widely across applications. Incorrect assumptions about these parameters can lead to ineffective scaling efforts.

General Building Blocks:

Scalable architectures often use general-purpose components arranged in familiar patterns.

Maintainability

The majority of software costs come from ongoing maintenance, not initial development. Many find maintaining legacy systems challenging due to outdated technology or unresolved issues. To minimize maintenance pain and avoid creating difficult legacy systems, design principles should focus on:

Operability: Ease for operations teams to manage the system.
Simplicity: Simplify the system to make it understandable for new engineers.
Evolvability: Facilitate future modifications to adapt to changing requirements.

These principles help ensure that software remains manageable and adaptable over time.

Operability: Making Life Easy for Operations

Effective software operations are crucial for maintaining system reliability, as good operations can often mitigate the shortcomings of incomplete software, whereas poor operations can hinder even well-designed software. Operations teams play a key role in system stability by monitoring health, troubleshooting issues, managing updates, ensuring compatibility, planning for future needs, establishing best practices, handling complex tasks, maintaining security, and preserving organizational knowledge. To support these tasks, software should be designed for good operability by offering clear monitoring, automation support, minimal dependencies, comprehensive documentation, flexible defaults, self-healing features, and predictable behavior.

Simplicity: Managing Complexity

As software projects grow, they often become increasingly complex, which can hinder understanding, maintenance, and lead to higher costs and bug risks. This complexity, often referred to as a "big ball of mud," manifests in various forms like tangled dependencies and inconsistent naming. Reducing complexity improves maintainability and can be achieved by removing accidental complexity—unnecessary complexity not inherent to the problem but arising from implementation. Abstractions are crucial tools for managing complexity by hiding implementation details and allowing for reusable, high-quality components. Effective abstractions streamline development, though finding and creating good abstractions, especially in distributed systems, can be challenging.

Evolvability: Making Change Easy

System requirements are likely to change due to new facts, emerging use cases, shifting business priorities, user demands, platform updates, and regulatory changes. Agile methodologies help manage these changes with techniques like test-driven development (TDD) and refactoring, though most discussions focus on smaller scales. This book explores how to enhance agility at a larger data system level, considering how to adapt and evolve complex architectures. The concept of evolvability—the ease of modifying a system to meet changing needs—is central to this discussion, linking system simplicity and effective abstractions to adaptability.

Conclusion

This article delves into the foundational aspects of designing data-intensive applications, based on insights from "Designing Data Intensive Applications." Key topics include identifying the building blocks of data systems such as databases, caches, and search indexes, and considering crucial factors like reliability, scalability, and maintainability. Reliability ensures systems work correctly under faults, scalability addresses handling increased load through various methods, and maintainability focuses on making systems manageable over time with principles like operability, simplicity, and evolvability. The article explores how these elements contribute to robust, efficient, and adaptable data systems.

Reliable, Scalable and Maintainable Applications

Questions to ask for making Data Systems

Factors that Influence the Design of a Data System

Three most Important concerns for Data Systems