jstok - JavaScript and TypeScript source code tokenizer
Allows to iterate over tokens in code (code units).
Example
import {jstok, TokenType} from 'https://deno.land/x/jstok@v0.1.4/mod.ts';
import {assertEquals} from "https://deno.land/std@0.106.0/testing/asserts.ts";
const source =
` // Comment
console.log(\`Current time: \${new Date}\`);
`;
assertEquals
( [...jstok(source)].map(v => Object.assign({}, v)),
[ {nLine: 1, nColumn: 1, level: 0, type: TokenType.WHITESPACE, text: "\t"},
{nLine: 1, nColumn: 5, level: 0, type: TokenType.COMMENT, text: "// Comment"},
{nLine: 1, nColumn: 15, level: 0, type: TokenType.WHITESPACE, text: "\n\t"},
{nLine: 2, nColumn: 5, level: 0, type: TokenType.IDENT, text: "console"},
{nLine: 2, nColumn: 12, level: 0, type: TokenType.OTHER, text: "."},
{nLine: 2, nColumn: 13, level: 0, type: TokenType.IDENT, text: "log"},
{nLine: 2, nColumn: 16, level: 0, type: TokenType.OTHER, text: "("},
{nLine: 2, nColumn: 17, level: 1, type: TokenType.STRING_TEMPLATE_BEGIN, text: "`Current time: ${"},
{nLine: 2, nColumn: 34, level: 2, type: TokenType.IDENT, text: "new"},
{nLine: 2, nColumn: 37, level: 2, type: TokenType.WHITESPACE, text: " "},
{nLine: 2, nColumn: 38, level: 2, type: TokenType.IDENT, text: "Date"},
{nLine: 2, nColumn: 42, level: 1, type: TokenType.STRING_TEMPLATE_END, text: "}`"},
{nLine: 2, nColumn: 44, level: 0, type: TokenType.OTHER, text: ")"},
{nLine: 2, nColumn: 45, level: 0, type: TokenType.OTHER, text: ";"},
{nLine: 2, nColumn: 46, level: 0, type: TokenType.MORE_REQUEST, text: "\n"},
{nLine: 2, nColumn: 46, level: 0, type: TokenType.WHITESPACE, text: "\n"},
]
);
for (const token of jstok(source))
{ if (token.type != TokenType.MORE_REQUEST)
{ console.log(token);
}
}
jstok() - Tokenize string
This function returns iterator over JavaScript or TypeScript tokens found in a source code provided as a string.
function jstok(source: string, tabWidth=4, nLine=1, nColumn=1): Generator<Token, void, string|boolean|undefined>;
class Token
{ text: string;
type: TokenType;
nLine: number;
nColumn: number;
level: number;
}
const enum TokenType
{ WHITESPACE,
COMMENT,
ATTRIBUTE,
IDENT,
NUMBER,
STRING,
STRING_TEMPLATE,
STRING_TEMPLATE_BEGIN,
STRING_TEMPLATE_MID,
STRING_TEMPLATE_END,
REGEXP,
OTHER,
MORE_REQUEST,
ERROR,
}
It will start counting lines and chars from the provided nLine
and nColumn
values. When counting chars, it will respect the desired tabWidth
.
Before returning the last token in the source, it generates TokenType.MORE_REQUEST
.
You can ignore it, or you can react by calling the following it.next(more)
function of the iterator with a string argument, that contains code continuation.
In this case this code will be appended to the last token, and the tokenization process will continue.
import {jstok, TokenType} from 'https://deno.land/x/jstok@v0.1.4/mod.ts';
let source =
` // Comment
console.log(\`Current time: \${new Date}\`);
`;
function read()
{ const part = source.slice(0, 10);
source = source.slice(10);
return part;
}
const it = jstok(read());
let token;
L:while ((token = it.next().value))
{ while (token.type == TokenType.MORE_REQUEST)
{ token = it.next(read()).value;
if (!token)
{ break L;
}
}
console.log(token);
}
This library cannot be used to check source code syntax.
Though in 3 cases it returns TokenType.ERROR
: 1) if invalid character occured; 2) if unbalanced bracket occured; 3) if occured comment inside string template parameter.
If you react to TokenType.ERROR
by calling it.next(ignore)
with true
argument, the error will be ignored, and the tokenization process will continue.
import {jstok, TokenType} from 'https://deno.land/x/jstok@v0.1.4/mod.ts';
const source =
` // Comment
}
console.log(\`Current time: \${new Date}\`);
`;
const it = jstok(source);
let token;
L:while ((token = it.next().value))
{ while (token.type == TokenType.ERROR)
{ console.log('Ignored wrong token:', token);
token = it.next(true).value;
if (!token)
{ break L;
}
}
console.log(token);
}
Token
class Token
{ constructor
( public text: string,
public type: TokenType,
public nLine = 1,
public nColumn = 1,
public level = 0,
);
toString(): string;
getValue(): string;
getNumberValue(): number | bigint;
getRegExpValue(): RegExp;
}
text
- original JavaScript token text.type
- Token type.nLine
- Line number where this token starts.nColumn
- Column number on the line where this token starts.level
- Nesting level. Entering(
,[
and{
increments the level counter. Also the level is incremented when entering${
parameters in string templates.
toString()
method returns original JavaScript token (this.text
), except for TokenType.MORE_REQUEST
, for which it returns empty string.
getValue()
method converts JavaScript token to it’s JavaScript value, if it’s string.
- For
TokenType.COMMENT
- it’s the text after//
or between/*
and*/
. - For
TokenType.STRING
and allTokenType.STRING_TEMPLATE*
types - it’s the JavaScript value of the token. - For
TokenType.MORE_REQUEST
- empty string. - For others, including
TokenType.NUMBER
- it’s the original JavaScript token.
getNumberValue()
method returns Number
or BigInt
value of the token for TokenType.NUMBER
tokens. For others returns NaN
.
getRegExpValue()
method returns RegExp
object. For TokenType.REGEXP
tokens it’s the regular expression that this token represents.
For other token types this method returns just a default empty RegExp
object.
TokenType
const enum TokenType
{ WHITESPACE,
COMMENT,
ATTRIBUTE,
IDENT,
NUMBER,
STRING,
STRING_TEMPLATE,
STRING_TEMPLATE_BEGIN,
STRING_TEMPLATE_MID,
STRING_TEMPLATE_END,
REGEXP,
OTHER,
MORE_REQUEST,
ERROR,
}
WHITESPACE
- Any number of any whitespace characters. Multiple such token types are not generated in sequence.COMMENT
- One single-line or multiline comment, or hashbang.ATTRIBUTE
- Like@json
.IDENT
- Can contain unicode letters. Private property names like#flags
are also consideredIDENT
s.NUMBER
- Number.STRING
- String.STRING_TEMPLATE
- Whole backtick-string, if it has no parameters.STRING_TEMPLATE_BEGIN
- First part of a backtick-string, till it’s first parameter. The contents of parameters will be tokenized separately, and returned as corresponding token types.STRING_TEMPLATE_MID
- Part of backtick-string between two parameters.STRING_TEMPLATE_END
- Last part of backtick-string.REGEXP
- Regular expression literal.OTHER
- Other tokens, like+
,++
,?.
, etc.MORE_REQUEST
- Before returning the last token found in the source string,jstok()
generate this meta-token. If then you callit.next(more)
with a nonempty string argument, this string will be appended to the last token, and the tokenization will continue.ERROR
- This token type is returned in 3 situations: 1) invalid character occured; 2) unbalanced bracket occured; 3) occured comment inside string template parameter. If then you callit.next(ignore)
withtrue
argument, this error condition will be ignored, and the tokenization will continue.
jstokReader() - Tokenize Deno.Reader
This function allows to tokenize a Deno.Reader
stream of JavaScript or TypeScript source code.
It never generates TokenType.MORE_REQUEST
.
async function *jstokReader(source: Deno.Reader, tabWidth=4, nLine=1, nColumn=1, decoder?: TextDecoder): AsyncGenerator<Token, void>;
It will start counting lines and chars from the provided nLine
and nColumn
values. When counting chars, it will respect the desired tabWidth
.
If decoder
is provided, will use it to convert bytes to text. This function only supports “utf-8”, “utf-16le”, “utf-16be” and all 1-byte encodings (not “big5”, etc.).
import {jstokReader} from 'https://deno.land/x/jstok@v0.1.4/mod.ts';
const fh = await Deno.open(new URL(import.meta.url).pathname, {read: true});
try
{ for await (const token of jstokReader(fh))
{ console.log(token);
}
}
finally
{ fh.close();
}