/mod.ts | codepoint_iterator@v1.1.1

import * as codepointIterator from "https://deno.land/x/codepoint_iterator@v1.1.1/mod.ts";

Examples

Imagine encoding the character '𝄞' (the G Clef symbol in music), which requires a 4-byte UTF-8 sequence.

Identify the lead byte for a 4-byte sequence: LEAD_FOR_4B (1111 0000 in binary)
The mask for extracting significant bits from the first byte in a 4-byte sequence: MASK_FOR_4B (0000 0111 in binary)
To encode '𝄞', we calculate its bits beyond the ASCII range, requiring BITS_FOR_4B (18 bits for the highest bits 19 -> 21).

The process involves:

Using LEAD_FOR_4B to start the encoding sequence.
Applying MASK_FOR_4B to extract the first few significant bits of the character.
Shifting by BITS_FOR_4B, BITS_FOR_3B, and BITS_FOR_2B to position the remaining bits correctly.

For a 2-byte character like 'Ω' (Omega):

Start with LEAD_FOR_2B (1100 0000 in binary) to indicate a 2-byte sequence.
Use MASK_FOR_2B (0001 1111 in binary) for the first byte's significant bits.
The shift amount is BITS_FOR_2B (6 bits for positions 7 to 12).

A 1-byte ASCII character, such as 'A':

Simply uses LEAD_FOR_1B (1000 0000 in binary) and MASK_FOR_1B (0011 1111 in binary) to represent the character in UTF-8.

Variables

v BITS_FOR_2B	Number of significant bits in a 2-byte sequence, used for characters beyond the ASCII range.
v BITS_FOR_3B	Number of significant bits in a 3-byte sequence, typically used for characters in many non-Western alphabets.
v BITS_FOR_4B	Number of significant bits in a 4-byte sequence, used for characters that are less common in daily use.
v LEAD_FOR_1B	Leading bits for a 1-byte sequence in UTF-8 encoding. This indicates that the character is represented with a single byte.
v LEAD_FOR_2B	Leading bits for a 2-byte sequence, indicating the start of a 2-byte encoded character.
v LEAD_FOR_3B	Leading bits for a 3-byte sequence, indicating the start of a 3-byte encoded character.
v LEAD_FOR_4B	Leading bits for a 4-byte sequence, indicating the start of a 4-byte encoded character.
v LEAD_FOR_5B	Leading bits for a 5-byte sequence. This is not officially used in UTF-8 encoding and is included for completeness.
v MASK_FOR_1B	Mask for extracting the significant bits from a 1-byte encoded character.
v MASK_FOR_2B	Mask for extracting the significant bits from a 2-byte encoded character.
v MASK_FOR_3B	Mask for extracting the significant bits from a 3-byte encoded character.
v MASK_FOR_4B	Mask for extracting the significant bits from a 4-byte encoded character.
v UTF8_MAX_BYTE_LENGTH	The maximum number of bytes required to represent any UTF-8 character. This constant defines the upper limit for UTF-8 encoded character size.

Functions

f asCodePointsArray	Converts an iterable of Uint8Array (byte arrays) into an array of Unicode code points. This is particularly useful for processing streams of text data, where each chunk is represented as a Uint8Array, and you want to work with the text's Unicode code points.
f asCodePointsCallback	Processes an iterable or async iterable of Uint8Array chunks and invokes a callback for each code point. The function performs the following steps: Iterate through the input iterable, which yields chunks of bytes (Uint8Array). Process each chunk using a TextDecoder to extract UTF-8 characters. Calculate the corresponding Unicode code points for the extracted characters. Invoke the provided callback for each code point.
f asCodePointsIterator	Converts an iterable of UTF-8 filled Uint8Array's into an async generator of Unicode code points.
f bytesToCodePoint	Converts a sequence of bytes into a Unicode code point. This function is a key part of decoding UTF-8 encoded text, as it translates the raw bytes back into the characters they represent.
f bytesToCodePointFromBuffer	Extracts a Unicode code point from a given buffer starting at a specified index. This method is useful for parsing a stream or array of data where UTF-8 characters are embedded within a larger set of binary data.
f codePointAt	Extracts the Unicode code point and its size in UTF-16 code units from a string at a given position.
f default	Converts an iterable of UTF-8 filled Uint8Array's into an async generator of Unicode code points.
f getByteLength	Calculates the number of bytes required to represent a single UTF-8 character.
f getIterableFromStream	Converts a `ReadableStream` into an async iterable. This allows for easier consumption of stream data using asynchronous iteration, providing a more modern approach to handling streamed data.
f getIterableStream	Converts a `ReadableStream` into an async iterable. This allows for easier consumption of stream data using asynchronous iteration, providing a more modern approach to handling streamed data.