SciLex
A header-only C++20 lexer built on REAL
Loading...
Searching...
No Matches
Public Attributes | List of all members
scilex::rule Struct Reference

A token rule: a kind, the pattern that recognizes it, whether matches are discarded (whitespace, comments), and — for contextual lexing — the modes it is active in and an optional mode transition it fires when it wins. More...

#include <lexer.hpp>

Public Attributes

int kind
 Kind assigned to tokens this rule produces.
 
real::regex pattern
 The recognizer (a linear-time REAL regex; its flags are the author's — see above).
 
bool skip {false}
 If true, matches are consumed but not emitted.
 
std::vector< std::string > in_mode {}
 Modes this rule is active in; empty ⇒ {"default"}.
 
std::optional< mode_actionaction {}
 Mode transition fired when this rule wins.
 

Detailed Description

A token rule: a kind, the pattern that recognizes it, whether matches are discarded (whitespace, comments), and — for contextual lexing — the modes it is active in and an optional mode transition it fires when it wins.

in_mode empty means the rule is active in the implicit "default" mode only, so a plain {kind, pattern, skip} rule keeps working unchanged.

The pattern is a fully-formed real::regex, so the grammar author owns its flags.

Unicode identifiers vs DFA speed — the grammar author's choice

This is a real trade-off, not a footnote. \w+ (or [^\W\d]\w*) with the default flags reads Unicode identifierscafé, 変数 — the faithful behaviour for a language like Python 3. But a Unicode \w \d \s \b compiles to a match-time code-point predicate, which no DFA can represent, so a mode that requests DFA acceleration (dfa_modes) and contains one is transparently demoted to the general Pike engine (same tokens; the demotion is visible via lexer::dfa_modes_active). Concretely: the general engine lexes at roughly 8–13 MB/s, while a DFA-accelerated mode runs about 20× that — so the Unicode identifier costs the DFA fast path.

If your identifiers are ASCII by specification (JSON, SQL, C), pin (?a) inline in the pattern (or pass real::flags::ascii) to keep \w \d \s \b ASCII and small, DFA-representable, and fast — this is what the examples/ grammars do. If you want Unicode identifiers, write \w+ and accept the general-engine floor. The two spellings tokenize the same ASCII input identically; they differ only on non-ASCII input and on whether the mode can be a DFA.

Definition at line 109 of file lexer.hpp.

Member Data Documentation

◆ action

std::optional<mode_action> scilex::rule::action {}

Mode transition fired when this rule wins.

Definition at line 115 of file lexer.hpp.

◆ in_mode

std::vector<std::string> scilex::rule::in_mode {}

Modes this rule is active in; empty ⇒ {"default"}.

Definition at line 114 of file lexer.hpp.

◆ kind

int scilex::rule::kind

Kind assigned to tokens this rule produces.

Definition at line 111 of file lexer.hpp.

◆ pattern

real::regex scilex::rule::pattern

The recognizer (a linear-time REAL regex; its flags are the author's — see above).

Definition at line 112 of file lexer.hpp.

◆ skip

bool scilex::rule::skip {false}

If true, matches are consumed but not emitted.

Definition at line 113 of file lexer.hpp.


The documentation for this struct was generated from the following file: