Understanding Text Encoding and Character Representation in HTML
Introduction
A webpage contains a large amount of text, and the browser needs to know the encoding method of this text in order to display it correctly.
Typically, when a server sends an HTML file to the browser, it specifies the webpage’s encoding method through the HTTP header.1
Content-Type: text/html; charset=UTF-8
In the example above, the Content-Type
field in the HTTP header first indicates that the data being sent by the server is of type text/html
(i.e., an HTML page), and then specifies that the text encoding is UTF-8
.
The webpage itself can also declare its encoding internally using the <meta>
tag:1
<meta charset="UTF-8">
Numeric Representation of Characters
Webpages can use different encoding methods, but UTF-8 is the most commonly used. UTF-8 is a way of encoding the Unicode character set, which is designed to include all characters in the world, currently encompassing over 100,000 characters.
Each character has a Unicode number, known as a code point. If you know the code point, you can determine the corresponding character. For example, the code point for the letter “a” is 97
in decimal (61
in hexadecimal), and the code point for the Chinese character “中” is 20013
in decimal (4e2d
in hexadecimal).
However, not all Unicode characters can be directly displayed in HTML for the following reasons:
- Not all Unicode characters are printable; some have no visible form, like the newline character with a code point of
10
in decimal (A
in hexadecimal). - The less-than (
<
) and greater-than (>
) signs are used to define HTML tags. When these symbols are needed elsewhere, they must be prevented from being interpreted as tags. - Due to the vast number of Unicode characters, it’s impossible to create an input method that allows direct input of all these characters. In other words, no keyboard can input every symbol.
- Webpages cannot mix multiple encodings. If you use UTF-8 encoding and want to insert characters from another encoding, it can be very challenging.
To solve these issues, HTML allows characters to be represented by their Unicode code points, which the browser automatically converts into the corresponding characters.
The syntax for representing characters by their code points is &#N;
(decimal, where N
is the code point) or &#xN;
(hexadecimal, where N
is the code point). For example, the character “a” can be written as a
(decimal) or a
(hexadecimal), and the character “中” can be written as 中
(decimal) or 中
(hexadecimal). The browser will automatically convert them.1
2
3
4
5<p>hello</p>
<!-- Equivalent to -->
<p>hello</p>
<!-- Equivalent to -->
<p>hello</p>
In the example above, characters can be represented directly or using their decimal or hexadecimal code points.
Note that HTML tags themselves cannot be represented by code points, as the browser will treat them as text content rather than tags. For instance, writing <p>
as <p>
or <p>
will cause the browser to display <p>
as text instead of recognizing it as a tag.
Entity Representation of Characters
The numeric representation of characters can be inconvenient because it requires knowing each character’s code point, which can be difficult to remember. To simplify input, HTML allows certain special characters to be represented by easy-to-remember names, known as entities.
Entities are written as &name;
, where name
is the character’s name. Here are some special characters and their corresponding entities:
<
:<
>
:>
"
:"
'
:'
&
:&
©
:©
#
:#
§
:§
¥
:¥
$
:$
£
:£
¢
:¢
%
:%
*
:$ast;
@
:@
^
:^
±
:±
- (space):
Note that the last special character is a space, which also has a corresponding entity representation.
Both the numeric and entity representations allow characters that cannot be directly input to be displayed on the webpage, thereby “escaping” the limitations of the browser. This is why the process is referred to as “escaping characters” in English.
Link to original article:
https://wangdoc.com/html/encode