Skip to main content
Module

x/codepoint_iterator/mod.ts

Fast uint8array to utf-8 codepoint iterator for streams and array buffers by @okikio & @jonathantneal
Latest
import * as codepointIterator from "https://deno.land/x/codepoint_iterator@v1.1.1/mod.ts";

Examples

Imagine encoding the character '𝄞' (the G Clef symbol in music), which requires a 4-byte UTF-8 sequence.

  1. Identify the lead byte for a 4-byte sequence: LEAD_FOR_4B (1111 0000 in binary)
  2. The mask for extracting significant bits from the first byte in a 4-byte sequence: MASK_FOR_4B (0000 0111 in binary)
  3. To encode '𝄞', we calculate its bits beyond the ASCII range, requiring BITS_FOR_4B (18 bits for the highest bits 19 -> 21).

The process involves:

  • Using LEAD_FOR_4B to start the encoding sequence.
  • Applying MASK_FOR_4B to extract the first few significant bits of the character.
  • Shifting by BITS_FOR_4B, BITS_FOR_3B, and BITS_FOR_2B to position the remaining bits correctly.

For a 2-byte character like 'Ω' (Omega):

  • Start with LEAD_FOR_2B (1100 0000 in binary) to indicate a 2-byte sequence.
  • Use MASK_FOR_2B (0001 1111 in binary) for the first byte's significant bits.
  • The shift amount is BITS_FOR_2B (6 bits for positions 7 to 12).

A 1-byte ASCII character, such as 'A':

  • Simply uses LEAD_FOR_1B (1000 0000 in binary) and MASK_FOR_1B (0011 1111 in binary) to represent the character in UTF-8.

Variables

Number of significant bits in a 2-byte sequence, used for characters beyond the ASCII range.

Number of significant bits in a 3-byte sequence, typically used for characters in many non-Western alphabets.

Number of significant bits in a 4-byte sequence, used for characters that are less common in daily use.

Leading bits for a 1-byte sequence in UTF-8 encoding. This indicates that the character is represented with a single byte.

Leading bits for a 2-byte sequence, indicating the start of a 2-byte encoded character.

Leading bits for a 3-byte sequence, indicating the start of a 3-byte encoded character.

Leading bits for a 4-byte sequence, indicating the start of a 4-byte encoded character.

Leading bits for a 5-byte sequence. This is not officially used in UTF-8 encoding and is included for completeness.

Mask for extracting the significant bits from a 1-byte encoded character.

Mask for extracting the significant bits from a 2-byte encoded character.

Mask for extracting the significant bits from a 3-byte encoded character.

Mask for extracting the significant bits from a 4-byte encoded character.

The maximum number of bytes required to represent any UTF-8 character. This constant defines the upper limit for UTF-8 encoded character size.

Functions

Converts an iterable of Uint8Array (byte arrays) into an array of Unicode code points. This is particularly useful for processing streams of text data, where each chunk is represented as a Uint8Array, and you want to work with the text's Unicode code points.

Processes an iterable or async iterable of Uint8Array chunks and invokes a callback for each code point. The function performs the following steps:

  • Iterate through the input iterable, which yields chunks of bytes (Uint8Array).
  • Process each chunk using a TextDecoder to extract UTF-8 characters.
  • Calculate the corresponding Unicode code points for the extracted characters.
  • Invoke the provided callback for each code point.

Converts an iterable of UTF-8 filled Uint8Array's into an async generator of Unicode code points.

Converts a sequence of bytes into a Unicode code point. This function is a key part of decoding UTF-8 encoded text, as it translates the raw bytes back into the characters they represent.

Extracts a Unicode code point from a given buffer starting at a specified index. This method is useful for parsing a stream or array of data where UTF-8 characters are embedded within a larger set of binary data.

Extracts the Unicode code point and its size in UTF-16 code units from a string at a given position.

Converts an iterable of UTF-8 filled Uint8Array's into an async generator of Unicode code points.

Calculates the number of bytes required to represent a single UTF-8 character.

Converts a ReadableStream into an async iterable. This allows for easier consumption of stream data using asynchronous iteration, providing a more modern approach to handling streamed data.

Converts a ReadableStream into an async iterable. This allows for easier consumption of stream data using asynchronous iteration, providing a more modern approach to handling streamed data.