Multibyte Character in C Language
Overview of Unicode
When the C language was first developed, it primarily focused on English characters, utilizing 7-bit ASCII to represent all characters. The ASCII range spans from 0 to 127, which means it can represent a maximum of about 128 characters, with each character fitting into a single byte.
However, dealing with non-English characters requires more than one byte. For example, just the Chinese language includes tens of thousands of characters, necessitating a character set that uses multiple bytes.
Initially, different countries had their own character encoding systems, making it difficult to mix various character sets. To address this, Unicode was developed, consolidating all characters into a single character set.
Unicode assigns a unique number to each character, known as a code point. The range from 0 to 127 overlaps with ASCII. Characters are typically represented in the format “U+hexadecimal code point,” where, for example, U+0041
represents the letter A
.
Currently, Unicode encompasses over a million characters, with code points ranging from U+0000 to U+10FFFF. To fully represent the entire Unicode character set, at least three bytes are needed. However, not all documents require that many characters. For English texts that only use ASCII, representing each character with three bytes would result in files three times larger than necessary.
To accommodate different usage needs, the Unicode Consortium has provided three encoding methods for representing Unicode code points:
- UTF-8: Uses 1 to 4 bytes to represent a code point, with the number of bytes varying by character.
- UTF-16: Represents characters in the Basic Multilingual Plane (U+0000 to U+FFFF) with 2 bytes, while other characters use 4 bytes.
- UTF-32: Uses a uniform 4 bytes for each code point.
UTF-8 is the most widely used encoding, as it represents ASCII characters (U+0000 to U+007F) with just one byte, making it compatible with traditional ASCII encoding.
C provides two macros that indicate the maximum number of bytes supported for character encoding, both defined in the limits.h
header:
- MB_LEN_MAX: The maximum byte length supported by any locale.
- MB_CUR_MAX: The maximum byte length for the current locale, which is always less than or equal to MB_LEN_MAX.
Character Representation
The essence of character representation is mapping each character to an integer, allowing retrieval from an encoding table.
C offers various methods to express the integer values of characters:
\123
: Represents a character using an octal value (three digits following the backslash).\x4D
: Represents a character using a hexadecimal value (\x
followed by a hexadecimal integer).\u2620
: Represents a character using a Unicode code point (not applicable to ASCII), with four hexadecimal digits following\u
.\U0001243F
: Represents a character using a Unicode code point (not applicable to ASCII), with eight hexadecimal digits following\U
.
Examples:
1 | printf("ABC\n"); |
All three lines output “ABC”.
1 | printf("\u2022 Bullet 1\n"); |
Both of these lines output “• Bullet 1”.
Representation of Multi-Byte Characters in C
In C, only basic characters can be represented using string literals. Other characters should be represented using their code points, and the current system must support the encoding method for those code points.
Basic characters refer to all printable ASCII characters, with three exceptions: @
, $
, and ```.
Thus, for non-English characters, you should use the Unicode code point format.
1 | char* s = "\u6625\u5929"; |
The code above outputs the Chinese word “春天” (spring).
If the current system uses UTF-8 encoding, you can directly represent multi-byte characters with string literals:
1 | char* s = "春天"; |
Note that the \u
and \U
syntax cannot be used to represent ASCII characters (code points less than 0xA0
), except for the following three characters: 0x24
($
), 0x40
(@
), and 0x60
(```).
1 | char* s = "\u0024\u0040\u0060"; |
The code above correctly outputs the three Unicode characters “@$`”, but other ASCII characters cannot be represented this way.
To ensure that characters are correctly interpreted during program execution, it’s best to switch the program environment to a localized setting:
1 | setlocale(LC_ALL, ""); |
The setlocale()
function changes the execution environment to the system’s locale language. Its prototype is defined in the header file locale.h
, as detailed in the standard library section on locale.h
.
You can also specify the encoding explicitly:
1 | setlocale(LC_ALL, "zh_CN.UTF-8"); |
This switches the program environment to a Chinese locale with UTF-8 encoding.
C also allows the use of the u8
prefix to specify UTF-8 encoding for multi-byte strings:
1 | char* s = u8"春天"; |
When a string contains multi-byte characters, the number of bytes does not equal the number of characters. For example, a string may be 10 bytes long but contain only 7 or 5 characters.
1 | setlocale(LC_ALL, ""); |
In this example, the string s
contains only two characters, but strlen()
returns 6, indicating that these two characters occupy 6 bytes.
C’s string functions only work with single-byte characters, so functions like strtok()
, strchr()
, strspn()
, toupper()
, tolower()
, and isalpha()
will not yield correct results with multi-byte characters.
Wide Characters
In the previous section, we discussed multi-byte strings, where the byte width of each character can vary. While this encoding method is convenient, it complicates string processing since each character must be examined individually to determine its byte length. To address this, C provides a fixed-width character storage method known as wide characters.
Wide characters use a consistent number of bytes per character, either 2 bytes or 4 bytes. This uniformity simplifies and speeds up processing.
Wide characters are represented by a specific data type, wchar_t
, which can be either signed or unsigned, depending on the implementation. This type typically has a length of 16 bits (2 bytes) or 32 bits (4 bytes), making it capable of storing all characters used by the current system. The wchar_t
type is defined in the header file wchar.h
.
To define wide character literals, you must prefix them with an “L”. Without this prefix, C will treat the literal as a narrow character type.
Here’s an example:
1 | setlocale(LC_ALL, ""); |
In this example, the “L” prefix before the single quote indicates a wide character, which corresponds to the %lc
format specifier in printf()
. Similarly, the prefix before the double quote signifies a wide string, corresponding to the %ls
specifier.
Wide strings also end with a wide null character, which occupies multiple bytes.
When working with wide characters, you should use functions specifically designed for them, most of which are defined in the wchar.h
header file.
Multibyte Character Handling Functions
mblen()
The mblen()
function returns the number of bytes occupied by a multibyte character. Its prototype is defined in the stdlib.h
header file:
1 | int mblen(const char* mbstr, size_t n); |
This function takes two parameters: the first is a pointer to a multibyte string, which typically checks the first character of the string; the second is the number of bytes to check, which should not exceed the maximum bytes used by a single character on the current system, typically represented by MB_CUR_MAX
.
The return value indicates the number of bytes used by the current character. If the character is a null wide character, it returns 0; if the character is invalid, it returns -1.
Example usage:
1 | setlocale(LC_ALL, ""); |
In the examples above, the first character “春” in the string “春天” occupies 3 bytes, while the first character “a” in the string “abc” occupies 1 byte.
wctomb()
The wctomb()
function (wide character to multibyte) converts a wide character to a multibyte character. Its prototype is also defined in the stdlib.h
header file:
1 | int wctomb(char* s, wchar_t wc); |
wctomb()
takes two parameters: the first is a buffer for the resulting multibyte character, and the second is the wide character to convert. The return value indicates the number of bytes used to store the multibyte character; if the conversion fails, it returns -1.
Example usage:
1 | setlocale(LC_ALL, ""); |
In this example, wctomb()
converts the wide character “牛” into a multibyte character, and the return value indicates that it occupies 3 bytes.
mbtowc()
The mbtowc()
function converts a multibyte character to a wide character. Its prototype is defined in the header file stdlib.h
.
1 | int mbtowc(wchar_t* wchar, const char* mbchar, size_t count); |
Parameters:
wchar_t* wchar
: A pointer to the wide character that will store the result.const char* mbchar
: A pointer to the multibyte character to be converted.size_t count
: The number of bytes available in the multibyte character.
Return Value:
Returns the number of bytes used by the multibyte character if successful; returns -1 if the conversion fails.
Example:
1 | setlocale(LC_ALL, ""); |
In this example, mbtowc()
converts the multibyte character “牛” to the wide character wc
, returning 3 as the number of bytes used by mbchar
.
wcstombs()
The wcstombs()
function converts a wide string to a multibyte string. Its prototype is also defined in stdlib.h
.
1 | size_t wcstombs(char* mbstr, const wchar_t* wcstr, size_t count); |
Parameters:
char* mbstr
: A pointer to the destination multibyte string.const wchar_t* wcstr
: A pointer to the wide string to be converted.size_t count
: The maximum number of bytes to write to the multibyte string.
Return Value:
Returns the number of bytes written to the multibyte string, not including the terminating null byte; returns -1 if the conversion fails.
Example:
1 | setlocale(LC_ALL, ""); |
In this case, wcstombs()
converts the wide string wcs
to the multibyte string mbs
, returning 6, which represents the number of bytes written, excluding the null terminator.
If the first argument of wcstombs()
is NULL, the function returns the number of bytes required for the conversion.
mbstowcs()
The mbstowcs()
function converts a multibyte string to a wide string. Its prototype is defined in stdlib.h
.
1 | size_t mbstowcs(wchar_t* wcstr, const char* mbstr, size_t count); |
Parameters:
wchar_t* wcstr
: A pointer to the destination wide string.const char* mbstr
: A pointer to the multibyte string to be converted.size_t count
: The maximum number of multibyte characters to convert.
Return Value:
Returns the number of wide characters successfully converted; returns -1 if the conversion fails. If the return value equals the third parameter, the resulting wide string is not null-terminated.
Example:
1 | setlocale(LC_ALL, ""); |
In this example, the multibyte string mbs
is converted to a wide string wcs
, successfully converting 4 characters, which is reflected in the return value.
If the first argument of mbstowcs()
is NULL, the function returns the number of wide characters that would be generated.