A token rule: a kind, the pattern that recognizes it, whether matches are discarded (whitespace, comments), and — for contextual lexing — the modes it is active in and an optional mode transition it fires when it wins. More...

#include <lexer.hpp>

Public Attributes
int	kind
	Kind assigned to tokens this rule produces.

real::regex	pattern
	The recognizer (a linear-time REAL regex; its flags are the author's — see above).

bool	skip {false}
	If true, matches are consumed but not emitted.

std::vector< std::string >	in_mode {}
	Modes this rule is active in; empty ⇒ {"default"}.

std::optional< mode_action >	action {}
	Mode transition fired when this rule wins.

Detailed Description

A token rule: a kind, the pattern that recognizes it, whether matches are discarded (whitespace, comments), and — for contextual lexing — the modes it is active in and an optional mode transition it fires when it wins.

in_mode empty means the rule is active in the implicit "default" mode only, so a plain {kind, pattern, skip} rule keeps working unchanged.

The pattern is a fully-formed real::regex, so the grammar author owns its flags.

Unicode identifiers vs DFA speed — the grammar author's choice

This is a real trade-off, not a footnote. \w+ (or [^\W\d]\w*) with the default flags reads Unicode identifiers — café, 変数 — the faithful behaviour for a language like Python 3. But a Unicode \w \d \s \b compiles to a match-time code-point predicate, which no DFA can represent, so a mode that requests DFA acceleration (dfa_modes) and contains one is transparently demoted to the general Pike engine (same tokens; the demotion is visible via lexer::dfa_modes_active). Concretely: the general engine lexes at roughly 8–13 MB/s, while a DFA-accelerated mode runs about 20× that — so the Unicode identifier costs the DFA fast path.

If your identifiers are ASCII by specification (JSON, SQL, C), pin (?a) inline in the pattern (or pass real::flags::ascii) to keep \w \d \s \b ASCII and small, DFA-representable, and fast — this is what the examples/ grammars do. If you want Unicode identifiers, write \w+ and accept the general-engine floor. The two spellings tokenize the same ASCII input identically; they differ only on non-ASCII input and on whether the mode can be a DFA.

Definition at line 109 of file lexer.hpp.

Member Data Documentation

◆ action

std::optional<mode_action> scilex::rule::action {}

Mode transition fired when this rule wins.

Definition at line 115 of file lexer.hpp.

◆ in_mode

std::vector<std::string> scilex::rule::in_mode {}

Modes this rule is active in; empty ⇒ {"default"}.

Definition at line 114 of file lexer.hpp.

◆ kind

int scilex::rule::kind

Kind assigned to tokens this rule produces.

Definition at line 111 of file lexer.hpp.

◆ pattern

real::regex scilex::rule::pattern

The recognizer (a linear-time REAL regex; its flags are the author's — see above).

Definition at line 112 of file lexer.hpp.

◆ skip

bool scilex::rule::skip {false}

If true, matches are consumed but not emitted.

Definition at line 113 of file lexer.hpp.

The documentation for this struct was generated from the following file:

include/scilex/lexer.hpp

Public Attributes