Analyzing Redis Source Code: Simple Dynamic Strings (SDS) – An Efficient and Flexible String Implementation
Redis’s Requirements for Strings
Strings are an essential data type in app development, frequently used to store a variety of information such as user profiles, product descriptions, and system messages. In Redis, a popular in-memory data store, both the keys and often the values in key-value pairs are represented as strings. This makes strings one of the foundational data types in Redis, contributing to its flexibility and simplicity when handling data.
For instance, when saving user data such as name, gender, and city, you can use a simple Redis command like:
1 | SET user:id:100 {"name": "John", "gender": "M", "city": "New York"} |
This example stores a stringified JSON object as the value, with a key that identifies the user.
Strings are also crucial for communication in Redis. The commands exchanged between Redis instances and clients, as well as much of the internal data, are transmitted as strings. Given the widespread use of strings in Redis, the way strings are implemented plays a critical role in system performance and efficiency. To optimize this, the implementation should satisfy the following three key requirements:
1. Support for a Wide Range of Efficient String Operations
String operations like appending, copying, comparing, and retrieving the length of a string are extremely common in app development. In Redis, strings are implemented in such a way that these operations are highly efficient. For example, appending data to a string can be done using the APPEND
command:
1 | APPEND user:name "Doe" |
This adds “Doe” to the existing string value stored at the user:name
key. Similar commands exist for other operations like comparing (STRLEN
) or copying strings. Efficient handling of these operations ensures that Redis remains performant, even as the volume of data grows.
2. Ability to Handle Arbitrary Binary Data
Strings in Redis are not limited to human-readable text. They can also store binary data such as images, audio files, or other types of raw data. This flexibility is crucial in modern app development, where applications frequently need to handle multimedia or complex data formats. Redis strings can store up to 512MB of binary data, allowing developers to use them for a variety of tasks beyond simple text storage.
For example, storing an image file in Redis might involve converting the image to binary and saving it as a string:
1 | SET image:profile:100 <binary_data> |
Redis doesn’t distinguish between text and binary data at the storage level, so it treats everything as a string, making the system both simple and flexible.
3. Minimizing Memory Overhead
Since Redis is often used in scenarios where speed and memory efficiency are critical, minimizing memory overhead for string storage is essential. Redis uses an optimized memory representation for strings. Short strings (typically 39 bytes or less) are stored in a highly efficient structure known as SDS (Simple Dynamic Strings), which reduces the overhead associated with managing string data.
Redis takes several measures to optimize memory usage:
- SDS Structure: SDS supports dynamic resizing, allowing it to efficiently manage memory when strings are appended or modified without needing to frequently reallocate memory.
- Lazy Reallocation: Redis avoids unnecessary memory allocation by resizing strings in chunks, reducing the cost of frequent modifications.
- Compact Storage for Small Strings: Small strings are packed together, reducing fragmentation and overall memory consumption.
Why doesn’t Redis Use char*
?
If you’ve developed programs in C, you’re probably familiar with using char*
arrays to implement strings. The C standard library (string.h
) provides various string operations like strcmp
for comparison, strlen
for length, and strcat
for appending, which simplifies these tasks for developers. However, why doesn’t Redis use char*
? To answer this question, we first need to understand the structure of char*
string arrays. Let’s delve into that analysis.
The Design of char*
Struct
The char*
array is a simple structure consisting of a contiguous block of memory where each character of the string is stored sequentially. For example, the char*
array structure for the string “johnson” is shown below.
The last character in this array is '\0'
. This character is crucial because it marks the end of the string. In C, char*
pointers refer to the start of the array, and functions in the standard library use '\0'
to determine where the string ends. For instance, the strlen
function calculates the length of a string by counting characters until it encounters '\0'
. The execution flow of strlen
is illustrated below.
Let’s look at how the '\0'
termination character affects string length with a code example. I’ve created two string variables, a
and b
, with values "john\0son"
and "johnson\0"
, respectively. Using the strlen
function to compute their lengths, the results are 4 and 7.
1 |
|
This is because the '\0'
in a
comes after “john”, so strlen
counts only the first 4 characters. For b
, the '\0'
is after “johnson”, so strlen
counts all 7 characters. This shows that char*
strings, which use '\0'
to indicate the end, can be problematic if the data itself contains '\0'
, as it leads to truncation. This limitation makes it unsuitable for storing arbitrary binary data, which is what Redis needs to handle.
The Complexity of String Operations
Aside from design issues related to char*
character arrays, using '\0'
as the string terminator allows functions to determine where the string ends. However, it also has a downside: it increases the complexity of string operations. Take the strlen
function as an example. This function needs to traverse each character in the array to determine the string’s length, resulting in a time complexity of O(N).
Another commonly used operation is the string concatenation function, strcat
. The strcat
function appends a source string (src
) to the end of a destination string. Here’s how the strcat
function works:
1 | /** |
From the code, we can see that both strcat
and strlen
functions have high complexity. They both require traversing the strings to find their end. For strcat
, there’s an additional traversal needed to append the source string to the destination. Furthermore, strcat
must ensure that there is enough space in the destination string to accommodate the new content; otherwise, the append operation will fail.
This means developers need to manage the space for the destination string carefully when using strcat
, often requiring dynamic allocation which adds to the programming complexity. Increased complexity in these operations affects the efficiency of string manipulation, which does not align with Redis’s need for efficient string handling.
Considering the shortcomings of using char*
for string implementation in C, we now need to explore alternative ways to handle strings. Therefore, next, let’s delve into how Redis approaches the design considerations for string implementation.
Design Philosophy of SDS
Since Redis is developed in C, it leverages character arrays to store data for compatibility with C standard library functions. However, unlike C’s basic character arrays, Redis introduces a specially designed data structure called SDS (Simple Dynamic Strings). Let’s take a closer look at it.
SDS Structure Design
The SDS structure includes a character array buf[]
for storing actual data. Additionally, it contains three pieces of metadata:
len
(the length of the data in the array),alloc
(the allocated space for the array),- and
flags
(the type of SDS).
Redis defines various data types for len
and alloc
to represent different SDS types, which I will explain in detail later. The diagram below illustrates the SDS structure for reference.
If you’ve looked into the Redis source code, you may have noticed that Redis uses typedef
to create an alias sds
for char*
.
1 | typedef char *sds; |
This is because, fundamentally, SDS is a character array with added metadata. In Redis, sds
is used whenever character arrays are needed.
When creating a new string, Redis calls the SDS creation function sdsnewlen
. This function allocates a new sds
variable (essentially a char*
), initializes an SDS structure, and assigns the buf[]
array in the SDS structure to the sds
variable. Finally, sdsnewlen
copies the desired string into the sds
variable. The code below illustrates the logic of the sdsnewlen
function.
1 | /* Create a new sds string with the content specified by the 'init' pointer |
With an understanding of the SDS structure, let’s now explore how SDS improves operation efficiency compared to traditional C strings.
SDS Operation Efficiency
The SDS structure tracks both the used and allocated space for the character array, which provides greater efficiency compared to traditional C strings. For example, consider the string append operation. In Redis, the function for appending strings is sdscatlen
in the sds.c
file. This function takes three parameters: the target string s
, the source string t
, and the length len
to append. Here’s the relevant source code:
1 | /* Append the specified binary-safe string pointed by 't' of 'len' bytes to the |
By analyzing the source code of this function, we can see that the implementation of sdscatlen
is quite straightforward and consists of three main steps:
- First, it retrieves the current length of the destination string and calls
sdsMakeRoomFor
to check if additional space is needed based on the current length and the length of the data to append. This step ensures there is enough space to accommodate the new data. - Once sufficient space is confirmed, the function appends the specified length of data from the source string to the destination string.
- Finally, it updates the length of the destination string.
I’ve included a diagram illustrating the execution process of sdscatlen
for clarity. Compared to string operations in C, SDS avoids the need for traversing the string by keeping track of the used length and allocated space, which reduces operational overhead. This design approach makes operations like creation, appending, copying, and comparison more efficient and is worth emulating.
Additionally, SDS encapsulates space checking and resizing for the target string in the sdsMakeRoomFor
function. This function is called directly for operations that involve changing the string’s space, such as appending or copying.
This design prevents developers from forgetting to resize the target string, which could otherwise lead to failed operations. For instance, when using the strcpy(char *dest, const char *src)
function, if src
is longer than dest
and we don’t check for this, it could result in a buffer overflow. Thus, this encapsulation approach is a valuable design principle to adopt.
Now, besides using metadata to track string length and the encapsulation approach, what other excellent design and implementation aspects does SDS offer? This is related to the memory efficiency needs I mentioned earlier about Redis.
Let’s delve into how SDS employs programming techniques to optimize memory usage.
Programming Techniques for Compact String Structures
Earlier, I mentioned that the SDS structure includes a metadata field called flags
, which indicates the SDS type. In fact, SDS defines five types: sdshdr5
, sdshdr8
, sdshdr16
, sdshdr32
, and sdshdr64
. The key difference among these types lies in the data types used for the len
(current length) and alloc
(allocated space) fields in their data structures.
Since Redis no longer uses the sdshdr5
type, we’ll focus on the remaining four types. For example, here’s the definition of sdshdr8
:
1 | struct __attribute__ ((__packed__)) sdshdr8 { |
We can see that for sdshdr8
, both the current length (len
) and allocated space (alloc
) use uint8_t
, an 8-bit unsigned integer that occupies 1 byte. This means that with sdshdr8
, the maximum length of the character array, including the null terminator, is 256 bytes (2^8).
For sdshdr16
, sdshdr32
, and sdshdr64
, the len
and alloc
fields are uint16_t
, uint32_t
, and uint64_t
respectively, allowing for character arrays up to 65,536 bytes, 4,294,967,296 bytes, and 18,446,744,073,709,551,616 bytes. Thus, the metadata for these types consumes 2, 4, and 8 bytes of memory respectively.
The reason for these different structures is to efficiently handle strings of varying sizes and save memory. If all structures used uint64_t
for len
and alloc
, even small strings would waste memory. For example, with a 10-byte string, a uint64_t
header would consume 16 bytes, which is inefficient.
Additionally, Redis uses compiler optimizations to further reduce memory usage. For the sdshdr8
structure, it uses the __attribute__((__packed__))
directive to prevent the compiler from aligning the structure to 8-byte boundaries. This packed attribute ensures that memory is allocated more compactly, avoiding the default 8-byte alignment that would waste space even if a variable is smaller than 8 bytes.
1 | struct __attribute__ ((__packed__)) sdshdr8 |
To help you understand, let me give you an example. Suppose I define a structure s1
with two member variables: one of type char
and the other of type int
, like this:
1 |
|
Although char
takes up 1 byte and int
takes up 4 bytes, if you run this code, you’ll notice that the total size printed is 8 bytes. This is because, by default, the compiler allocates 8 bytes to the s1
structure, leaving 3 bytes unused due to padding.
To optimize memory usage, Redis takes a very efficient approach. It uses the __attribute__((__packed__))
attribute to define structures, ensuring the compiler only allocates exactly the amount of memory needed for the structure’s fields.
For example, if I define another structure s2
with the same char
and int
fields, but use the __attribute__((__packed__))
attribute, the code would look like this:
1 |
|
When you run this code, the printed result is 5 bytes, indicating that the compiler has packed the structure more tightly, using only 5 bytes of memory.
In summary, if you’re developing a program and want to minimize the memory footprint of your data structures, you can use the __attribute__((__packed__))
technique to achieve this.
Conclusion
In this lesson, I introduced you to the design and implementation of strings in Redis. It’s important to understand that string implementations need to prioritize efficiency, support storing any type of binary data, and minimize memory usage. Redis’ approach to string design is definitely something worth studying and applying.
There are three key points you should focus on:
- The limitations of using
char*
to implement strings in C. Sincechar*
relies on the\0
character to indicate the end of a string, operations often require scanning the entire string, which isn’t efficient. Moreover, it can’t fully represent data that contains\0
, which doesn’t meet Redis’ needs. - The design principles and implementation of strings in Redis. Redis introduces a specialized data structure called SDS (Simple Dynamic String), which builds on character arrays by adding metadata like the length of the string and the size of the allocated space. This allows operations like appending, copying, and comparing strings to be done by directly reading the metadata, improving efficiency. SDS also doesn’t use the
\0
character to mark the end of the string, treating it as binary data, which allows for storing things like images. - SDS uses different types to represent strings of varying sizes and employs the
__attribute__((__packed__))
technique to achieve a compact memory layout, helping to save memory.
While strings may seem simple, today’s lesson shows that implementing them involves thoughtful design. The differences and similarities between C’s char*
strings and Redis’ SDS are often asked about in Redis interviews, so I hope you’ll master these distinctions through this lesson.