|
SciLex
A header-only C++20 lexer built on REAL
|
A token rule: a kind, the pattern that recognizes it, whether matches are discarded (whitespace, comments), and — for contextual lexing — the modes it is active in and an optional mode transition it fires when it wins. More...
#include <lexer.hpp>
Public Attributes | |
| int | kind |
| Kind assigned to tokens this rule produces. | |
| real::regex | pattern |
| The recognizer (a linear-time REAL regex; its flags are the author's — see above). | |
| bool | skip {false} |
| If true, matches are consumed but not emitted. | |
| std::vector< std::string > | in_mode {} |
| Modes this rule is active in; empty ⇒ {"default"}. | |
| std::optional< mode_action > | action {} |
| Mode transition fired when this rule wins. | |
A token rule: a kind, the pattern that recognizes it, whether matches are discarded (whitespace, comments), and — for contextual lexing — the modes it is active in and an optional mode transition it fires when it wins.
in_mode empty means the rule is active in the implicit "default" mode only, so a plain {kind, pattern, skip} rule keeps working unchanged.
The pattern is a fully-formed real::regex, so the grammar author owns its flags.
This is a real trade-off, not a footnote. \w+ (or [^\W\d]\w*) with the default flags reads Unicode identifiers — café, 変数 — the faithful behaviour for a language like Python 3. But a Unicode \w \d \s \b compiles to a match-time code-point predicate, which no DFA can represent, so a mode that requests DFA acceleration (dfa_modes) and contains one is transparently demoted to the general Pike engine (same tokens; the demotion is visible via lexer::dfa_modes_active). Concretely: the general engine lexes at roughly 8–13 MB/s, while a DFA-accelerated mode runs about 20× that — so the Unicode identifier costs the DFA fast path.
If your identifiers are ASCII by specification (JSON, SQL, C), pin (?a) inline in the pattern (or pass real::flags::ascii) to keep \w \d \s \b ASCII and small, DFA-representable, and fast — this is what the examples/ grammars do. If you want Unicode identifiers, write \w+ and accept the general-engine floor. The two spellings tokenize the same ASCII input identically; they differ only on non-ASCII input and on whether the mode can be a DFA.
| std::optional<mode_action> scilex::rule::action {} |
| std::vector<std::string> scilex::rule::in_mode {} |
| int scilex::rule::kind |
| real::regex scilex::rule::pattern |
| bool scilex::rule::skip {false} |