Redis’s Requirements for Strings

Strings are an essential data type in app development, frequently used to store a variety of information such as user profiles, product descriptions, and system messages. In Redis, a popular in-memory data store, both the keys and often the values in key-value pairs are represented as strings. This makes strings one of the foundational data types in Redis, contributing to its flexibility and simplicity when handling data.

For instance, when saving user data such as name, gender, and city, you can use a simple Redis command like:

1
SET user:id:100 {"name": "John", "gender": "M", "city": "New York"}

This example stores a stringified JSON object as the value, with a key that identifies the user.

Strings are also crucial for communication in Redis. The commands exchanged between Redis instances and clients, as well as much of the internal data, are transmitted as strings. Given the widespread use of strings in Redis, the way strings are implemented plays a critical role in system performance and efficiency. To optimize this, the implementation should satisfy the following three key requirements:

1. Support for a Wide Range of Efficient String Operations

String operations like appending, copying, comparing, and retrieving the length of a string are extremely common in app development. In Redis, strings are implemented in such a way that these operations are highly efficient. For example, appending data to a string can be done using the APPEND command:

1
APPEND user:name "Doe"

This adds “Doe” to the existing string value stored at the user:name key. Similar commands exist for other operations like comparing (STRLEN) or copying strings. Efficient handling of these operations ensures that Redis remains performant, even as the volume of data grows.

2. Ability to Handle Arbitrary Binary Data

Strings in Redis are not limited to human-readable text. They can also store binary data such as images, audio files, or other types of raw data. This flexibility is crucial in modern app development, where applications frequently need to handle multimedia or complex data formats. Redis strings can store up to 512MB of binary data, allowing developers to use them for a variety of tasks beyond simple text storage.

For example, storing an image file in Redis might involve converting the image to binary and saving it as a string:

1
SET image:profile:100 <binary_data>

Redis doesn’t distinguish between text and binary data at the storage level, so it treats everything as a string, making the system both simple and flexible.

3. Minimizing Memory Overhead

Since Redis is often used in scenarios where speed and memory efficiency are critical, minimizing memory overhead for string storage is essential. Redis uses an optimized memory representation for strings. Short strings (typically 39 bytes or less) are stored in a highly efficient structure known as SDS (Simple Dynamic Strings), which reduces the overhead associated with managing string data.

Redis takes several measures to optimize memory usage:

  • SDS Structure: SDS supports dynamic resizing, allowing it to efficiently manage memory when strings are appended or modified without needing to frequently reallocate memory.
  • Lazy Reallocation: Redis avoids unnecessary memory allocation by resizing strings in chunks, reducing the cost of frequent modifications.
  • Compact Storage for Small Strings: Small strings are packed together, reducing fragmentation and overall memory consumption.

Why doesn’t Redis Use char*?

If you’ve developed programs in C, you’re probably familiar with using char* arrays to implement strings. The C standard library (string.h) provides various string operations like strcmp for comparison, strlen for length, and strcat for appending, which simplifies these tasks for developers. However, why doesn’t Redis use char*? To answer this question, we first need to understand the structure of char* string arrays. Let’s delve into that analysis.

The Design of char* Struct

The char* array is a simple structure consisting of a contiguous block of memory where each character of the string is stored sequentially. For example, the char* array structure for the string “johnson” is shown below.

The last character in this array is '\0'. This character is crucial because it marks the end of the string. In C, char* pointers refer to the start of the array, and functions in the standard library use '\0' to determine where the string ends. For instance, the strlen function calculates the length of a string by counting characters until it encounters '\0'. The execution flow of strlen is illustrated below.

Let’s look at how the '\0' termination character affects string length with a code example. I’ve created two string variables, a and b, with values "john\0son" and "johnson\0", respectively. Using the strlen function to compute their lengths, the results are 4 and 7.

1
2
3
4
5
6
7
8
9
10
11
#include <stdio.h>
#include <string.h>

int main()
{
char *a = "john\0son";
char *b = "johnson\0";
printf("%lu\n", strlen(a));
printf("%lu\n", strlen(b));
return 0;
}

This is because the '\0' in a comes after “john”, so strlen counts only the first 4 characters. For b, the '\0' is after “johnson”, so strlen counts all 7 characters. This shows that char* strings, which use '\0' to indicate the end, can be problematic if the data itself contains '\0', as it leads to truncation. This limitation makes it unsuitable for storing arbitrary binary data, which is what Redis needs to handle.

The Complexity of String Operations

Aside from design issues related to char* character arrays, using '\0' as the string terminator allows functions to determine where the string ends. However, it also has a downside: it increases the complexity of string operations. Take the strlen function as an example. This function needs to traverse each character in the array to determine the string’s length, resulting in a time complexity of O(N).

Another commonly used operation is the string concatenation function, strcat. The strcat function appends a source string (src) to the end of a destination string. Here’s how the strcat function works:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
/**
* strcat - Concatenates two strings
* @dest: Destination string to which the source string will be appended
* @src: Source string that will be appended to the destination string
*
* This function concatenates the source string `src` to the end of the destination string `dest`.
* It first finds the end of the destination string `dest` (the first null character),
* then copies each character from the source string `src` into the destination string `dest`
* until the null character of the source string `src` is reached. Finally, it returns the original pointer to the destination string.
*
* Returns: Pointer to the destination string after concatenation
*/
char *strcat(char *dest, const char *src)
{
// Temporarily store the pointer to the destination string
char *tmp = dest;

// Find the end of the destination string
while (*dest)
dest++;

// Copy each character from the source string to the destination string until the null character
while ((*dest++ = *src++) != '\0')
;

// Return the original pointer to the destination string
return tmp;
}

From the code, we can see that both strcat and strlen functions have high complexity. They both require traversing the strings to find their end. For strcat, there’s an additional traversal needed to append the source string to the destination. Furthermore, strcat must ensure that there is enough space in the destination string to accommodate the new content; otherwise, the append operation will fail.

This means developers need to manage the space for the destination string carefully when using strcat, often requiring dynamic allocation which adds to the programming complexity. Increased complexity in these operations affects the efficiency of string manipulation, which does not align with Redis’s need for efficient string handling.

Considering the shortcomings of using char* for string implementation in C, we now need to explore alternative ways to handle strings. Therefore, next, let’s delve into how Redis approaches the design considerations for string implementation.

Design Philosophy of SDS

Since Redis is developed in C, it leverages character arrays to store data for compatibility with C standard library functions. However, unlike C’s basic character arrays, Redis introduces a specially designed data structure called SDS (Simple Dynamic Strings). Let’s take a closer look at it.

SDS Structure Design

The SDS structure includes a character array buf[] for storing actual data. Additionally, it contains three pieces of metadata:

  • len (the length of the data in the array),
  • alloc (the allocated space for the array),
  • and flags (the type of SDS).

Redis defines various data types for len and alloc to represent different SDS types, which I will explain in detail later. The diagram below illustrates the SDS structure for reference.

If you’ve looked into the Redis source code, you may have noticed that Redis uses typedef to create an alias sds for char*.

1
typedef char *sds;

This is because, fundamentally, SDS is a character array with added metadata. In Redis, sds is used whenever character arrays are needed.

When creating a new string, Redis calls the SDS creation function sdsnewlen. This function allocates a new sds variable (essentially a char*), initializes an SDS structure, and assigns the buf[] array in the SDS structure to the sds variable. Finally, sdsnewlen copies the desired string into the sds variable. The code below illustrates the logic of the sdsnewlen function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
/* Create a new sds string with the content specified by the 'init' pointer
* and 'initlen'.
* If NULL is used for 'init' the string is initialized with zero bytes.
* If SDS_NOINIT is used, the buffer is left uninitialized;
*
* The string is always null-termined (all the sds strings are, always) so
* even if you create an sds string with:
*
* mystring = sdsnewlen("abc",3);
*
* You can print the string with printf() as there is an implicit \0 at the
* end of the string. However the string is binary safe and can contain
* \0 characters in the middle, as the length is stored in the sds header. */
sds _sdsnewlen(const void *init, size_t initlen, int trymalloc) {
void *sh;
sds s;
char type = sdsReqType(initlen);
/* Empty strings are usually created in order to append. Use type 8
* since type 5 is not good at this. */
if (type == SDS_TYPE_5 && initlen == 0) type = SDS_TYPE_8;
int hdrlen = sdsHdrSize(type);
unsigned char *fp; /* flags pointer. */
size_t usable;

assert(initlen + hdrlen + 1 > initlen); /* Catch size_t overflow */
sh = trymalloc?
s_trymalloc_usable(hdrlen+initlen+1, &usable) :
s_malloc_usable(hdrlen+initlen+1, &usable);
if (sh == NULL) return NULL;
if (init==SDS_NOINIT)
init = NULL;
else if (!init)
memset(sh, 0, hdrlen+initlen+1);
s = (char*)sh+hdrlen;
fp = ((unsigned char*)s)-1;
usable = usable-hdrlen-1;
if (usable > sdsTypeMaxSize(type))
usable = sdsTypeMaxSize(type);
switch(type) {
case SDS_TYPE_5: {
*fp = type | (initlen << SDS_TYPE_BITS);
break;
}
case SDS_TYPE_8: {
SDS_HDR_VAR(8,s);
sh->len = initlen;
sh->alloc = usable;
*fp = type;
break;
}
case SDS_TYPE_16: {
SDS_HDR_VAR(16,s);
sh->len = initlen;
sh->alloc = usable;
*fp = type;
break;
}
case SDS_TYPE_32: {
SDS_HDR_VAR(32,s);
sh->len = initlen;
sh->alloc = usable;
*fp = type;
break;
}
case SDS_TYPE_64: {
SDS_HDR_VAR(64,s);
sh->len = initlen;
sh->alloc = usable;
*fp = type;
break;
}
}
if (initlen && init)
memcpy(s, init, initlen);
s[initlen] = '\0';
return s;
}

With an understanding of the SDS structure, let’s now explore how SDS improves operation efficiency compared to traditional C strings.

SDS Operation Efficiency

The SDS structure tracks both the used and allocated space for the character array, which provides greater efficiency compared to traditional C strings. For example, consider the string append operation. In Redis, the function for appending strings is sdscatlen in the sds.c file. This function takes three parameters: the target string s, the source string t, and the length len to append. Here’s the relevant source code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
/* Append the specified binary-safe string pointed by 't' of 'len' bytes to the
* end of the specified sds string 's'.
*
* After the call, the passed sds string is no longer valid and all the
* references must be substituted with the new pointer returned by the call. */
sds sdscatlen(sds s, const void *t, size_t len) {
size_t curlen = sdslen(s);

s = sdsMakeRoomFor(s,len);
if (s == NULL) return NULL;
memcpy(s+curlen, t, len);
sdssetlen(s, curlen+len);
s[curlen+len] = '\0';
return s;
}

By analyzing the source code of this function, we can see that the implementation of sdscatlen is quite straightforward and consists of three main steps:

  1. First, it retrieves the current length of the destination string and calls sdsMakeRoomFor to check if additional space is needed based on the current length and the length of the data to append. This step ensures there is enough space to accommodate the new data.
  2. Once sufficient space is confirmed, the function appends the specified length of data from the source string to the destination string.
  3. Finally, it updates the length of the destination string.

I’ve included a diagram illustrating the execution process of sdscatlen for clarity. Compared to string operations in C, SDS avoids the need for traversing the string by keeping track of the used length and allocated space, which reduces operational overhead. This design approach makes operations like creation, appending, copying, and comparison more efficient and is worth emulating.

Additionally, SDS encapsulates space checking and resizing for the target string in the sdsMakeRoomFor function. This function is called directly for operations that involve changing the string’s space, such as appending or copying.

This design prevents developers from forgetting to resize the target string, which could otherwise lead to failed operations. For instance, when using the strcpy(char *dest, const char *src) function, if src is longer than dest and we don’t check for this, it could result in a buffer overflow. Thus, this encapsulation approach is a valuable design principle to adopt.

Now, besides using metadata to track string length and the encapsulation approach, what other excellent design and implementation aspects does SDS offer? This is related to the memory efficiency needs I mentioned earlier about Redis.

Let’s delve into how SDS employs programming techniques to optimize memory usage.

Programming Techniques for Compact String Structures

Earlier, I mentioned that the SDS structure includes a metadata field called flags, which indicates the SDS type. In fact, SDS defines five types: sdshdr5, sdshdr8, sdshdr16, sdshdr32, and sdshdr64. The key difference among these types lies in the data types used for the len (current length) and alloc (allocated space) fields in their data structures.

Since Redis no longer uses the sdshdr5 type, we’ll focus on the remaining four types. For example, here’s the definition of sdshdr8:

1
2
3
4
5
6
struct __attribute__ ((__packed__)) sdshdr8 {
uint8_t len; /* used */
uint8_t alloc; /* excluding the header and null terminator */
unsigned char flags; /* 3 lsb of type, 5 unused bits */
char buf[];
};

We can see that for sdshdr8, both the current length (len) and allocated space (alloc) use uint8_t, an 8-bit unsigned integer that occupies 1 byte. This means that with sdshdr8, the maximum length of the character array, including the null terminator, is 256 bytes (2^8).

For sdshdr16, sdshdr32, and sdshdr64, the len and alloc fields are uint16_t, uint32_t, and uint64_t respectively, allowing for character arrays up to 65,536 bytes, 4,294,967,296 bytes, and 18,446,744,073,709,551,616 bytes. Thus, the metadata for these types consumes 2, 4, and 8 bytes of memory respectively.

The reason for these different structures is to efficiently handle strings of varying sizes and save memory. If all structures used uint64_t for len and alloc, even small strings would waste memory. For example, with a 10-byte string, a uint64_t header would consume 16 bytes, which is inefficient.

Additionally, Redis uses compiler optimizations to further reduce memory usage. For the sdshdr8 structure, it uses the __attribute__((__packed__)) directive to prevent the compiler from aligning the structure to 8-byte boundaries. This packed attribute ensures that memory is allocated more compactly, avoiding the default 8-byte alignment that would waste space even if a variable is smaller than 8 bytes.

1
struct __attribute__ ((__packed__)) sdshdr8

To help you understand, let me give you an example. Suppose I define a structure s1 with two member variables: one of type char and the other of type int, like this:

1
2
3
4
5
6
7
8
9
10
11
#include <stdio.h>

int main() {
struct s1 {
char a;
int b;
} ts1;

printf("%lu\n", sizeof(ts1));
return 0;
}

Although char takes up 1 byte and int takes up 4 bytes, if you run this code, you’ll notice that the total size printed is 8 bytes. This is because, by default, the compiler allocates 8 bytes to the s1 structure, leaving 3 bytes unused due to padding.

To optimize memory usage, Redis takes a very efficient approach. It uses the __attribute__((__packed__)) attribute to define structures, ensuring the compiler only allocates exactly the amount of memory needed for the structure’s fields.

For example, if I define another structure s2 with the same char and int fields, but use the __attribute__((__packed__)) attribute, the code would look like this:

1
2
3
4
5
6
7
8
9
10
11
#include <stdio.h>

int main() {
struct __attribute__((packed)) s2 {
char a;
int b;
} ts2;

printf("%lu\n", sizeof(ts2));
return 0;
}

When you run this code, the printed result is 5 bytes, indicating that the compiler has packed the structure more tightly, using only 5 bytes of memory.

In summary, if you’re developing a program and want to minimize the memory footprint of your data structures, you can use the __attribute__((__packed__)) technique to achieve this.

Conclusion

In this lesson, I introduced you to the design and implementation of strings in Redis. It’s important to understand that string implementations need to prioritize efficiency, support storing any type of binary data, and minimize memory usage. Redis’ approach to string design is definitely something worth studying and applying.

There are three key points you should focus on:

  1. The limitations of using char* to implement strings in C. Since char* relies on the \0 character to indicate the end of a string, operations often require scanning the entire string, which isn’t efficient. Moreover, it can’t fully represent data that contains \0, which doesn’t meet Redis’ needs.
  2. The design principles and implementation of strings in Redis. Redis introduces a specialized data structure called SDS (Simple Dynamic String), which builds on character arrays by adding metadata like the length of the string and the size of the allocated space. This allows operations like appending, copying, and comparing strings to be done by directly reading the metadata, improving efficiency. SDS also doesn’t use the \0 character to mark the end of the string, treating it as binary data, which allows for storing things like images.
  3. SDS uses different types to represent strings of varying sizes and employs the __attribute__((__packed__)) technique to achieve a compact memory layout, helping to save memory.

While strings may seem simple, today’s lesson shows that implementing them involves thoughtful design. The differences and similarities between C’s char* strings and Redis’ SDS are often asked about in Redis interviews, so I hope you’ll master these distinctions through this lesson.