import * as codepointIterator from "https://deno.land/x/codepoint_iterator@v1.1.1/constants.ts";
Examples
Imagine encoding the character '𝄞' (the G Clef symbol in music), which requires a 4-byte UTF-8 sequence.
Imagine encoding the character '𝄞' (the G Clef symbol in music), which requires a 4-byte UTF-8 sequence.
- Identify the lead byte for a 4-byte sequence:
LEAD_FOR_4B
(1111 0000 in binary) - The mask for extracting significant bits from the first byte in a 4-byte sequence:
MASK_FOR_4B
(0000 0111 in binary) - To encode '𝄞', we calculate its bits beyond the ASCII range, requiring
BITS_FOR_4B
(18 bits for the highest bits 19 -> 21).
The process involves:
- Using
LEAD_FOR_4B
to start the encoding sequence. - Applying
MASK_FOR_4B
to extract the first few significant bits of the character. - Shifting by
BITS_FOR_4B
,BITS_FOR_3B
, andBITS_FOR_2B
to position the remaining bits correctly.
For a 2-byte character like 'Ω' (Omega):
- Start with
LEAD_FOR_2B
(1100 0000 in binary) to indicate a 2-byte sequence. - Use
MASK_FOR_2B
(0001 1111 in binary) for the first byte's significant bits. - The shift amount is
BITS_FOR_2B
(6 bits for positions 7 to 12).
A 1-byte ASCII character, such as 'A':
- Simply uses
LEAD_FOR_1B
(1000 0000 in binary) andMASK_FOR_1B
(0011 1111 in binary) to represent the character in UTF-8.
Variables
Number of significant bits in a 2-byte sequence, used for characters beyond the ASCII range. | |
Number of significant bits in a 3-byte sequence, typically used for characters in many non-Western alphabets. | |
Number of significant bits in a 4-byte sequence, used for characters that are less common in daily use. | |
Leading bits for a 1-byte sequence in UTF-8 encoding. This indicates that the character is represented with a single byte. | |
Leading bits for a 2-byte sequence, indicating the start of a 2-byte encoded character. | |
Leading bits for a 3-byte sequence, indicating the start of a 3-byte encoded character. | |
Leading bits for a 4-byte sequence, indicating the start of a 4-byte encoded character. | |
Leading bits for a 5-byte sequence. This is not officially used in UTF-8 encoding and is included for completeness. | |
Mask for extracting the significant bits from a 1-byte encoded character. | |
Mask for extracting the significant bits from a 2-byte encoded character. | |
Mask for extracting the significant bits from a 3-byte encoded character. | |
Mask for extracting the significant bits from a 4-byte encoded character. | |
The maximum number of bytes required to represent any UTF-8 character. This constant defines the upper limit for UTF-8 encoded character size. |