|
SciLex
A header-only C++20 lexer built on REAL
|
A lexer built from an ordered list of rules. More...
#include <lexer.hpp>
Classes | |
| struct | dispatch |
| Per-mode dispatch index: the first-byte buckets scoped to one mode. More... | |
| struct | mode_dfa |
| An adopted per-mode DFA: the automaton plus its local→global rule map. More... | |
| struct | munch_result |
| A munch decision: whether a rule matched, which (global index), how many bytes — the small value scan_next's Pike branch and the audit share. More... | |
Public Member Functions | |
| lexer (std::vector< rule > rules, std::unordered_set< std::string > insignificant_modes={}, std::unordered_set< std::string > dfa_modes={}, error_policy errors=error_policy::raise, column_unit columns=column_unit::bytes) | |
Builds a lexer from rules (taken by value, then moved in). | |
| column_unit | columns () const noexcept |
The unit this lexer counts position::column in (positions do not carry it, so a consumer that needs to interpret a column reads the unit here). | |
| std::vector< token > | tokenize (std::string_view source, eof_policy policy=eof_policy::omit) const |
Tokenizes source into the sequence of non-skipped tokens. | |
| token_range | scan (std::string_view source, eof_policy policy=eof_policy::omit) const & |
Returns a lazy range over the non-skipped tokens of source. | |
| token_range | scan (std::string_view source, eof_policy policy=eof_policy::omit) const &&=delete |
| Deleted: the range would point into a temporary lexer. | |
| const std::vector< bool > & | mode_significant () const noexcept |
The per-mode-id layout-significance policy (see scilex::layout). Index by a token's mode_id; false marks an insignificant mode. Empty unless the lexer was built with insignificant_modes. | |
| const std::string & | mode_name (std::size_t id) const noexcept |
The name of mode id (0 is "default"), for labelling tokens. | |
| std::vector< std::string > | dfa_modes_active () const |
| The modes actually accelerated by a DFA fast path. | |
Private Member Functions | |
| bool | scan_next (std::string_view source, position &cursor, std::vector< frame > &stack, token &out) const |
Advances cursor to and past the next non-skipped token in the active mode, applying the winning rule's mode transition (if any). | |
| std::size_t | intern_mode (const std::string &name) |
| Interns a mode name to its id, assigning the next id on first sight. | |
| void | add_to_mode (std::size_t m, std::size_t idx) |
Adds rule idx to mode m's dispatch via REAL's exact first-byte API — the same 3-way split (nullable → general; one fixed byte → its bucket; otherwise → every bucket the set admits) as the mono-mode build. | |
| bool | mode_is_empty (std::size_t m) const |
Whether mode m has no active rule (so nothing can match in it). | |
| void | validate_transitions () const |
| Fail-fast transition checks: a transition rule must consume input, and a push/set target must be a defined, non-empty mode. | |
| void | build_dispatch () |
| Builds the per-mode first-byte dispatch from rules_ (once, at construction). "default" is mode 0; every name in a rule's rule::in_mode and every push/set target is interned to an id. For each mode, the rules active in it are bucketed by REAL's exact first-byte API, keeping rules_ order so the index tie-break in scan_next is the rule priority. Finishes with validate_transitions. | |
| void | build_significance (const std::unordered_set< std::string > &insignificant_modes) |
| Builds the layout-significance policy from the insignificant-mode names (validated against the interned modes). With none, the policy stays empty — every mode significant, so scilex::layout is the positional pass (invariant 1). | |
| munch_result | munch_at (std::size_t mode, std::string_view rest, unsigned char lead) const |
The winning munch in mode at the start of rest (lead is rest's first byte), dispatching to the mode's DFA when it has one, else the Pike + first-byte munch. The single match primitive both the forward scan and the error-recovery probe call, so the two never diverge on which rule wins. | |
| bool | may_start (std::size_t mode, unsigned char byte) const |
O(1) pre-filter for error recovery: can a fixed-lead rule in mode begin with byte? A false is conclusive (no rule can match, so the byte is error text); a true still needs a full munch_at to confirm a real match. This is what skips the bulk of noise cheaply. | |
| bool | starts_a_match (std::size_t mode, std::string_view source, std::size_t offset) const |
Does a rule in mode match at offset in source? The error-recovery loop's stop test — the smallest such offset ends an error run. Since recovery never runs in a mode with a nullable rule (see may_start), any match here has positive length, so have alone is the stop condition (a zero-length win is impossible in this context). | |
| void | advance (std::string_view source, position &cursor, std::size_t n) const |
Advances cursor by n bytes of source, maintaining the 1-based line/column tracker (a newline resets the column). The shared advance point for a matched token, a recovery step, and the error-run scan. | |
| munch_result | pike_munch_in_mode (std::size_t mode, std::string_view rest, unsigned char lead) const |
The per-rule Pike + first-byte-dispatch munch in mode at the start of rest (lead is rest's first byte). Zero allocation; shared by scan_next's Pike branch and the DFA audit. A zero-length match is reported as a candidate (it wins only when nothing matches >0); scan_next's shared guard turns such a win into a lexical error. | |
| bool | rule_active_in_mode (std::size_t idx, std::size_t mode) const |
Whether rule idx is active in mode mode (mirrors build_dispatch, an empty in_mode is the default mode only; otherwise the listed modes). | |
| std::vector< std::string > | audit_probes (const std::vector< std::size_t > &to_global) const |
| The bounded, deterministic probe inputs for the audit: every active rule's possible first bytes ∪ structural bytes, as singletons and short repeats (repeats expose lazy delimiters and quantifier boundaries — the hard cases), then fixed-seed random strings. At most 512 inputs, each ≤ 48 bytes. | |
| bool | audit_passes (const real::dfa &candidate, const std::vector< std::size_t > &to_global, std::size_t mode) const |
| The candidate DFA must reproduce the Pike munch on every probe: catches divergences the bytecode cannot reveal — chiefly a lazy quantifier, whose match() is the shortest span while the DFA takes the longest. | |
| std::optional< mode_dfa > | try_build_mode_dfa (std::vector< std::size_t > to_global, std::size_t mode) |
Builds the mode_dfa for one mode, or std::nullopt if the mode cannot take the DFA fast path. Two non-error reasons return nullopt: the rules contain an un-DFA-able assertion (real::dfa throws real::dfa_error — caught here, turned into nullopt), or the DFA fails the build-time equivalence audit (a lazy quantifier). Both leave the mode on Pike. Returning the outcome instead of throwing past the caller keeps the fast-path decision explicit at the call site. | |
| void | build_dfa_modes (const std::unordered_set< std::string > &dfa_modes) |
Opts the named dfa_modes into the DFA fast path (called once, after build_dispatch). For each, builds a real::dfa from the mode's active rules in ascending global index (= priority); an un-DFA-able assertion (real::dfa_error) or a failed audit_passes leaves the mode on Pike (nullptr). Best-effort — see dfa_modes_active. | |
Static Private Member Functions | |
| static std::string | position_label (position where) |
| Formats a position as "line:column" for diagnostics. | |
| static std::size_t | valid_utf8_len (std::string_view s, std::size_t off) |
The length (1–4) of a valid UTF-8 codepoint starting at off in s, or 0 when the byte there is not a valid lead — a continuation byte, a truncated/overlong/surrogate/ out-of-range sequence, or an invalid lead. The column stepper's UTF-8 validator. | |
| static std::size_t | column_step (std::string_view source, std::size_t off, scilex::column_unit unit) |
How much the column advances when the byte at off in source is consumed, under unit. bytes is always 1 (so the byte mode is the historical column == byte offset). codepoints counts one per valid codepoint (its lead scores 1, its continuations 0); utf16 scores 2 for an astral (4-byte) codepoint, 1 otherwise. A malformed byte — including an orphan continuation — scores 1 in every unit, so the column stays defined across the error runs recovery emits. | |
Private Attributes | |
| std::vector< rule > | rules_ |
| The ordered token rules. | |
| error_policy | errors_ |
| What to do at an unmatched byte. | |
| scilex::column_unit | columns_ |
| The unit position::column is counted in. | |
| std::vector< std::string > | mode_names_ |
| Mode id -> name ("default" is id 0). | |
| std::map< std::string, std::size_t > | mode_id_ |
| Mode name -> id. | |
| std::vector< dispatch > | per_mode_ |
| Dispatch index, one per mode (by id). | |
| std::vector< std::shared_ptr< const mode_dfa > > | per_mode_dfa_ |
| Per-mode DFA fast path (nullptr = Pike). | |
| std::vector< bool > | mode_significant_ |
| Layout policy (empty = all significant). | |
Friends | |
| class | token_iterator |
A lexer built from an ordered list of rules.
Order matters only as a tie-breaker between rules whose matches have equal length (the first such rule wins). Put more specific rules (keywords) before their general counterparts (identifiers).
|
inlineexplicit |
Builds a lexer from rules (taken by value, then moved in).
| [in] | rules | The ordered token rules. |
| [in] | insignificant_modes | Modes whose tokens carry no layout structure (Layout Awareness Level A — see scilex::layout). Each name must be a mode the rules use; empty (the default) leaves every mode significant, so mode_significant has no effect. |
| [in] | dfa_modes | Modes to accelerate with a real::dfa fast path (one DFA pass replaces the per-rule Pike dispatch). Each name must be a mode the rules use. Opt-in is best-effort: a mode whose rules cannot be a DFA (a zero-width assertion) or whose DFA fails the build-time audit (a lazy quantifier) silently stays on Pike — see dfa_modes_active. The token stream is identical either way. |
| [in] | errors | What to do at a byte no rule can lex: error_policy::raise (the default — throw) or error_policy::token (recover, emitting an scilex::error token). The recovery path never throws per byte; the token stream under raise is unchanged. |
| [in] | columns | The unit each token's position::column is counted in: column_unit::bytes (the default, unchanged), column_unit::codepoints, or column_unit::utf16. The unit is not stored on the position — read it back with columns(). |
| std::invalid_argument | If a transition rule is malformed (empty pattern or target), or insignificant_modes / dfa_modes names an unknown mode. |
|
inlineprivate |
|
inlineprivate |
|
inlineprivate |
|
inlineprivate |
The bounded, deterministic probe inputs for the audit: every active rule's possible first bytes ∪ structural bytes, as singletons and short repeats (repeats expose lazy delimiters and quantifier boundaries — the hard cases), then fixed-seed random strings. At most 512 inputs, each ≤ 48 bytes.
|
inlineprivate |
Opts the named dfa_modes into the DFA fast path (called once, after build_dispatch). For each, builds a real::dfa from the mode's active rules in ascending global index (= priority); an un-DFA-able assertion (real::dfa_error) or a failed audit_passes leaves the mode on Pike (nullptr). Best-effort — see dfa_modes_active.
| [in] | dfa_modes | The opted-in mode names. The build-time equivalence audit always runs; its outcome is observable via dfa_modes_active. |
| std::invalid_argument | If dfa_modes names an unknown mode. |
|
inlineprivate |
Builds the per-mode first-byte dispatch from rules_ (once, at construction). "default" is mode 0; every name in a rule's rule::in_mode and every push/set target is interned to an id. For each mode, the rules active in it are bucketed by REAL's exact first-byte API, keeping rules_ order so the index tie-break in scan_next is the rule priority. Finishes with validate_transitions.
|
inlineprivate |
Builds the layout-significance policy from the insignificant-mode names (validated against the interned modes). With none, the policy stays empty — every mode significant, so scilex::layout is the positional pass (invariant 1).
|
inlinestaticprivate |
How much the column advances when the byte at off in source is consumed, under unit. bytes is always 1 (so the byte mode is the historical column == byte offset). codepoints counts one per valid codepoint (its lead scores 1, its continuations 0); utf16 scores 2 for an astral (4-byte) codepoint, 1 otherwise. A malformed byte — including an orphan continuation — scores 1 in every unit, so the column stays defined across the error runs recovery emits.
|
inlinenoexcept |
The unit this lexer counts position::column in (positions do not carry it, so a consumer that needs to interpret a column reads the unit here).
|
inline |
The modes actually accelerated by a DFA fast path.
A mode passed in dfa_modes but rejected — its rules need an assertion no DFA can represent (real::dfa_error), or its DFA failed the build-time audit (a lazy quantifier) — is absent here: it fell back to the Pike path, lexing the same tokens. So dfa_modes is an optimizer, not a guarantee; the rejected set is dfa_modes − dfa_modes_active().
|
inlineprivate |
|
inlineprivate |
O(1) pre-filter for error recovery: can a fixed-lead rule in mode begin with byte? A false is conclusive (no rule can match, so the byte is error text); a true still needs a full munch_at to confirm a real match. This is what skips the bulk of noise cheaply.
Only the first-byte buckets are consulted, not the mode's general (nullable) rules: a nullable rule matches the empty string at every position, so a mode that had one could never reach the no-match case (#1) that starts a recovery run — it would report a zero-length error (#4) at the very first byte instead. So during recovery the active mode provably has no general rule.
|
inlineprivate |
|
inlinenoexcept |
|
inlinenoexcept |
The per-mode-id layout-significance policy (see scilex::layout). Index by a token's mode_id; false marks an insignificant mode. Empty unless the lexer was built with insignificant_modes.
|
inlineprivate |
The winning munch in mode at the start of rest (lead is rest's first byte), dispatching to the mode's DFA when it has one, else the Pike + first-byte munch. The single match primitive both the forward scan and the error-recovery probe call, so the two never diverge on which rule wins.
|
inlineprivate |
The per-rule Pike + first-byte-dispatch munch in mode at the start of rest (lead is rest's first byte). Zero allocation; shared by scan_next's Pike branch and the DFA audit. A zero-length match is reported as a candidate (it wins only when nothing matches >0); scan_next's shared guard turns such a win into a lexical error.
|
inlinestaticprivate |
|
inlineprivate |
Whether rule idx is active in mode mode (mirrors build_dispatch, an empty in_mode is the default mode only; otherwise the listed modes).
|
inline |
Returns a lazy range over the non-skipped tokens of source.
Each ++ produces the next token on demand; nothing but the current token is held. Usable in a range-for. Errors surface as lex_error thrown while advancing.
| [in] | source | The text to scan (must outlive the iteration; each token's lexeme views into it). |
| [in] | policy | Whether to yield a terminal end_of_input token. |
| lex_error | (while iterating) if some position matches no rule. |
|
delete |
Deleted: the range would point into a temporary lexer.
|
inlineprivate |
Advances cursor to and past the next non-skipped token in the active mode, applying the winning rule's mode transition (if any).
Skips over skip-rule matches, then fills out with the next emitted token, advancing cursor (offset, line, byte column) and the mode stack past it. The single source of scanning truth shared by tokenize and the lazy token_iterator.
| [in] | source | The text being scanned. |
| [in,out] | cursor | Current position; advanced past the consumed bytes. |
| [in,out] | stack | The mode stack (its top is the active mode); a winning rule's transition mutates it. Never empty. |
| [out] | out | Receives the next non-skipped token on success. |
true if a token was produced, false at end of input. | lex_error | If a position matches no rule in the active mode (#1), a rule pops at the stack root (#2), input ends inside a pushed mode (#3), or the winning match is zero-length and so cannot advance the scan (#4). |
|
inlineprivate |
Does a rule in mode match at offset in source? The error-recovery loop's stop test — the smallest such offset ends an error run. Since recovery never runs in a mode with a nullable rule (see may_start), any match here has positive length, so have alone is the stop condition (a zero-length win is impossible in this context).
|
inline |
Tokenizes source into the sequence of non-skipped tokens.
At each position every rule is matched anchored; the longest match wins (ties broken by rule order). A zero-length winning match (a nullable rule with no longer match here) cannot advance the scan, so it is reported as a lex_error rather than allowed to stall.
| [in] | source | The text to tokenize (must outlive the returned tokens; each token's lexeme views into it). |
| [in] | policy | Whether to append a terminal end_of_input token. |
| lex_error | If some position is matched by no rule, or only by a zero-length match. |
|
inlineprivate |
Builds the mode_dfa for one mode, or std::nullopt if the mode cannot take the DFA fast path. Two non-error reasons return nullopt: the rules contain an un-DFA-able assertion (real::dfa throws real::dfa_error — caught here, turned into nullopt), or the DFA fails the build-time equivalence audit (a lazy quantifier). Both leave the mode on Pike. Returning the outcome instead of throwing past the caller keeps the fast-path decision explicit at the call site.
| [in] | to_global | The mode's active rules, ascending global index (priority). |
| [in] | mode | The mode id (for the audit). |
|
inlinestaticprivate |
|
inlineprivate |
|
friend |
|
private |
The unit position::column is counted in.
|
private |
|
private |
|
private |
|
private |
|
private |
|
private |
|
private |