/constants.ts | codepoint_iterator@v1.1.1

import * as codepointIterator from "https://deno.land/x/codepoint_iterator@v1.1.1/constants.ts";

Examples

Imagine encoding the character '𝄞' (the G Clef symbol in music), which requires a 4-byte UTF-8 sequence.

Identify the lead byte for a 4-byte sequence: LEAD_FOR_4B (1111 0000 in binary)
The mask for extracting significant bits from the first byte in a 4-byte sequence: MASK_FOR_4B (0000 0111 in binary)
To encode '𝄞', we calculate its bits beyond the ASCII range, requiring BITS_FOR_4B (18 bits for the highest bits 19 -> 21).

The process involves:

Using LEAD_FOR_4B to start the encoding sequence.
Applying MASK_FOR_4B to extract the first few significant bits of the character.
Shifting by BITS_FOR_4B, BITS_FOR_3B, and BITS_FOR_2B to position the remaining bits correctly.

For a 2-byte character like 'Ω' (Omega):

A 1-byte ASCII character, such as 'A':

Simply uses LEAD_FOR_1B (1000 0000 in binary) and MASK_FOR_1B (0011 1111 in binary) to represent the character in UTF-8.

v BITS_FOR_2B	Number of significant bits in a 2-byte sequence, used for characters beyond the ASCII range.
v BITS_FOR_3B	Number of significant bits in a 3-byte sequence, typically used for characters in many non-Western alphabets.
v BITS_FOR_4B	Number of significant bits in a 4-byte sequence, used for characters that are less common in daily use.
v LEAD_FOR_1B	Leading bits for a 1-byte sequence in UTF-8 encoding. This indicates that the character is represented with a single byte.
v LEAD_FOR_2B	Leading bits for a 2-byte sequence, indicating the start of a 2-byte encoded character.
v LEAD_FOR_3B	Leading bits for a 3-byte sequence, indicating the start of a 3-byte encoded character.
v LEAD_FOR_4B	Leading bits for a 4-byte sequence, indicating the start of a 4-byte encoded character.
v LEAD_FOR_5B	Leading bits for a 5-byte sequence. This is not officially used in UTF-8 encoding and is included for completeness.
v MASK_FOR_1B	Mask for extracting the significant bits from a 1-byte encoded character.
v MASK_FOR_2B	Mask for extracting the significant bits from a 2-byte encoded character.
v MASK_FOR_3B	Mask for extracting the significant bits from a 3-byte encoded character.
v MASK_FOR_4B	Mask for extracting the significant bits from a 4-byte encoded character.
v UTF8_MAX_BYTE_LENGTH	The maximum number of bytes required to represent any UTF-8 character. This constant defines the upper limit for UTF-8 encoded character size.