Skip to main content
Module

x/codepoint_iterator/constants.ts

Fast uint8array to utf-8 codepoint iterator for streams and array buffers by @okikio & @jonathantneal
Latest
import * as codepointIterator from "https://deno.land/x/codepoint_iterator@v1.1.1/constants.ts";

Examples

Imagine encoding the character '𝄞' (the G Clef symbol in music), which requires a 4-byte UTF-8 sequence.

  1. Identify the lead byte for a 4-byte sequence: LEAD_FOR_4B (1111 0000 in binary)
  2. The mask for extracting significant bits from the first byte in a 4-byte sequence: MASK_FOR_4B (0000 0111 in binary)
  3. To encode '𝄞', we calculate its bits beyond the ASCII range, requiring BITS_FOR_4B (18 bits for the highest bits 19 -> 21).

The process involves:

  • Using LEAD_FOR_4B to start the encoding sequence.
  • Applying MASK_FOR_4B to extract the first few significant bits of the character.
  • Shifting by BITS_FOR_4B, BITS_FOR_3B, and BITS_FOR_2B to position the remaining bits correctly.

For a 2-byte character like 'Ω' (Omega):

  • Start with LEAD_FOR_2B (1100 0000 in binary) to indicate a 2-byte sequence.
  • Use MASK_FOR_2B (0001 1111 in binary) for the first byte's significant bits.
  • The shift amount is BITS_FOR_2B (6 bits for positions 7 to 12).

A 1-byte ASCII character, such as 'A':

  • Simply uses LEAD_FOR_1B (1000 0000 in binary) and MASK_FOR_1B (0011 1111 in binary) to represent the character in UTF-8.

Variables

Number of significant bits in a 2-byte sequence, used for characters beyond the ASCII range.

Number of significant bits in a 3-byte sequence, typically used for characters in many non-Western alphabets.

Number of significant bits in a 4-byte sequence, used for characters that are less common in daily use.

Leading bits for a 1-byte sequence in UTF-8 encoding. This indicates that the character is represented with a single byte.

Leading bits for a 2-byte sequence, indicating the start of a 2-byte encoded character.

Leading bits for a 3-byte sequence, indicating the start of a 3-byte encoded character.

Leading bits for a 4-byte sequence, indicating the start of a 4-byte encoded character.

Leading bits for a 5-byte sequence. This is not officially used in UTF-8 encoding and is included for completeness.

Mask for extracting the significant bits from a 1-byte encoded character.

Mask for extracting the significant bits from a 2-byte encoded character.

Mask for extracting the significant bits from a 3-byte encoded character.

Mask for extracting the significant bits from a 4-byte encoded character.

The maximum number of bytes required to represent any UTF-8 character. This constant defines the upper limit for UTF-8 encoded character size.