SciLex
A header-only C++20 lexer built on REAL
Loading...
Searching...
No Matches
Classes | Public Member Functions | Private Member Functions | Static Private Member Functions | Private Attributes | Friends | List of all members
scilex::lexer Class Reference

A lexer built from an ordered list of rules. More...

#include <lexer.hpp>

Classes

struct  dispatch
 Per-mode dispatch index: the first-byte buckets scoped to one mode. More...
 
struct  mode_dfa
 An adopted per-mode DFA: the automaton plus its local→global rule map. More...
 
struct  munch_result
 A munch decision: whether a rule matched, which (global index), how many bytes — the small value scan_next's Pike branch and the audit share. More...
 

Public Member Functions

 lexer (std::vector< rule > rules, std::unordered_set< std::string > insignificant_modes={}, std::unordered_set< std::string > dfa_modes={}, error_policy errors=error_policy::raise, column_unit columns=column_unit::bytes)
 Builds a lexer from rules (taken by value, then moved in).
 
column_unit columns () const noexcept
 The unit this lexer counts position::column in (positions do not carry it, so a consumer that needs to interpret a column reads the unit here).
 
std::vector< tokentokenize (std::string_view source, eof_policy policy=eof_policy::omit) const
 Tokenizes source into the sequence of non-skipped tokens.
 
token_range scan (std::string_view source, eof_policy policy=eof_policy::omit) const &
 Returns a lazy range over the non-skipped tokens of source.
 
token_range scan (std::string_view source, eof_policy policy=eof_policy::omit) const &&=delete
 Deleted: the range would point into a temporary lexer.
 
const std::vector< bool > & mode_significant () const noexcept
 The per-mode-id layout-significance policy (see scilex::layout). Index by a token's mode_id; false marks an insignificant mode. Empty unless the lexer was built with insignificant_modes.
 
const std::string & mode_name (std::size_t id) const noexcept
 The name of mode id (0 is "default"), for labelling tokens.
 
std::vector< std::string > dfa_modes_active () const
 The modes actually accelerated by a DFA fast path.
 

Private Member Functions

bool scan_next (std::string_view source, position &cursor, std::vector< frame > &stack, token &out) const
 Advances cursor to and past the next non-skipped token in the active mode, applying the winning rule's mode transition (if any).
 
std::size_t intern_mode (const std::string &name)
 Interns a mode name to its id, assigning the next id on first sight.
 
void add_to_mode (std::size_t m, std::size_t idx)
 Adds rule idx to mode m's dispatch via REAL's exact first-byte API — the same 3-way split (nullable → general; one fixed byte → its bucket; otherwise → every bucket the set admits) as the mono-mode build.
 
bool mode_is_empty (std::size_t m) const
 Whether mode m has no active rule (so nothing can match in it).
 
void validate_transitions () const
 Fail-fast transition checks: a transition rule must consume input, and a push/set target must be a defined, non-empty mode.
 
void build_dispatch ()
 Builds the per-mode first-byte dispatch from rules_ (once, at construction). "default" is mode 0; every name in a rule's rule::in_mode and every push/set target is interned to an id. For each mode, the rules active in it are bucketed by REAL's exact first-byte API, keeping rules_ order so the index tie-break in scan_next is the rule priority. Finishes with validate_transitions.
 
void build_significance (const std::unordered_set< std::string > &insignificant_modes)
 Builds the layout-significance policy from the insignificant-mode names (validated against the interned modes). With none, the policy stays empty — every mode significant, so scilex::layout is the positional pass (invariant 1).
 
munch_result munch_at (std::size_t mode, std::string_view rest, unsigned char lead) const
 The winning munch in mode at the start of rest (lead is rest's first byte), dispatching to the mode's DFA when it has one, else the Pike + first-byte munch. The single match primitive both the forward scan and the error-recovery probe call, so the two never diverge on which rule wins.
 
bool may_start (std::size_t mode, unsigned char byte) const
 O(1) pre-filter for error recovery: can a fixed-lead rule in mode begin with byte? A false is conclusive (no rule can match, so the byte is error text); a true still needs a full munch_at to confirm a real match. This is what skips the bulk of noise cheaply.
 
bool starts_a_match (std::size_t mode, std::string_view source, std::size_t offset) const
 Does a rule in mode match at offset in source? The error-recovery loop's stop test — the smallest such offset ends an error run. Since recovery never runs in a mode with a nullable rule (see may_start), any match here has positive length, so have alone is the stop condition (a zero-length win is impossible in this context).
 
void advance (std::string_view source, position &cursor, std::size_t n) const
 Advances cursor by n bytes of source, maintaining the 1-based line/column tracker (a newline resets the column). The shared advance point for a matched token, a recovery step, and the error-run scan.
 
munch_result pike_munch_in_mode (std::size_t mode, std::string_view rest, unsigned char lead) const
 The per-rule Pike + first-byte-dispatch munch in mode at the start of rest (lead is rest's first byte). Zero allocation; shared by scan_next's Pike branch and the DFA audit. A zero-length match is reported as a candidate (it wins only when nothing matches >0); scan_next's shared guard turns such a win into a lexical error.
 
bool rule_active_in_mode (std::size_t idx, std::size_t mode) const
 Whether rule idx is active in mode mode (mirrors build_dispatch, an empty in_mode is the default mode only; otherwise the listed modes).
 
std::vector< std::string > audit_probes (const std::vector< std::size_t > &to_global) const
 The bounded, deterministic probe inputs for the audit: every active rule's possible first bytes ∪ structural bytes, as singletons and short repeats (repeats expose lazy delimiters and quantifier boundaries — the hard cases), then fixed-seed random strings. At most 512 inputs, each ≤ 48 bytes.
 
bool audit_passes (const real::dfa &candidate, const std::vector< std::size_t > &to_global, std::size_t mode) const
 The candidate DFA must reproduce the Pike munch on every probe: catches divergences the bytecode cannot reveal — chiefly a lazy quantifier, whose match() is the shortest span while the DFA takes the longest.
 
std::optional< mode_dfatry_build_mode_dfa (std::vector< std::size_t > to_global, std::size_t mode)
 Builds the mode_dfa for one mode, or std::nullopt if the mode cannot take the DFA fast path. Two non-error reasons return nullopt: the rules contain an un-DFA-able assertion (real::dfa throws real::dfa_error — caught here, turned into nullopt), or the DFA fails the build-time equivalence audit (a lazy quantifier). Both leave the mode on Pike. Returning the outcome instead of throwing past the caller keeps the fast-path decision explicit at the call site.
 
void build_dfa_modes (const std::unordered_set< std::string > &dfa_modes)
 Opts the named dfa_modes into the DFA fast path (called once, after build_dispatch). For each, builds a real::dfa from the mode's active rules in ascending global index (= priority); an un-DFA-able assertion (real::dfa_error) or a failed audit_passes leaves the mode on Pike (nullptr). Best-effort — see dfa_modes_active.
 

Static Private Member Functions

static std::string position_label (position where)
 Formats a position as "line:column" for diagnostics.
 
static std::size_t valid_utf8_len (std::string_view s, std::size_t off)
 The length (1–4) of a valid UTF-8 codepoint starting at off in s, or 0 when the byte there is not a valid lead — a continuation byte, a truncated/over­long/surrogate/ out-of-range sequence, or an invalid lead. The column stepper's UTF-8 validator.
 
static std::size_t column_step (std::string_view source, std::size_t off, scilex::column_unit unit)
 How much the column advances when the byte at off in source is consumed, under unit. bytes is always 1 (so the byte mode is the historical column == byte offset). codepoints counts one per valid codepoint (its lead scores 1, its continuations 0); utf16 scores 2 for an astral (4-byte) codepoint, 1 otherwise. A malformed byte — including an orphan continuation — scores 1 in every unit, so the column stays defined across the error runs recovery emits.
 

Private Attributes

std::vector< rulerules_
 The ordered token rules.
 
error_policy errors_
 What to do at an unmatched byte.
 
scilex::column_unit columns_
 The unit position::column is counted in.
 
std::vector< std::string > mode_names_
 Mode id -> name ("default" is id 0).
 
std::map< std::string, std::size_t > mode_id_
 Mode name -> id.
 
std::vector< dispatchper_mode_
 Dispatch index, one per mode (by id).
 
std::vector< std::shared_ptr< const mode_dfa > > per_mode_dfa_
 Per-mode DFA fast path (nullptr = Pike).
 
std::vector< bool > mode_significant_
 Layout policy (empty = all significant).
 

Friends

class token_iterator
 

Detailed Description

A lexer built from an ordered list of rules.

Order matters only as a tie-breaker between rules whose matches have equal length (the first such rule wins). Put more specific rules (keywords) before their general counterparts (identifiers).

Definition at line 235 of file lexer.hpp.

Constructor & Destructor Documentation

◆ lexer()

scilex::lexer::lexer ( std::vector< rule rules,
std::unordered_set< std::string >  insignificant_modes = {},
std::unordered_set< std::string >  dfa_modes = {},
error_policy  errors = error_policy::raise,
column_unit  columns = column_unit::bytes 
)
inlineexplicit

Builds a lexer from rules (taken by value, then moved in).

Parameters
[in]rulesThe ordered token rules.
[in]insignificant_modesModes whose tokens carry no layout structure (Layout Awareness Level A — see scilex::layout). Each name must be a mode the rules use; empty (the default) leaves every mode significant, so mode_significant has no effect.
[in]dfa_modesModes to accelerate with a real::dfa fast path (one DFA pass replaces the per-rule Pike dispatch). Each name must be a mode the rules use. Opt-in is best-effort: a mode whose rules cannot be a DFA (a zero-width assertion) or whose DFA fails the build-time audit (a lazy quantifier) silently stays on Pike — see dfa_modes_active. The token stream is identical either way.
[in]errorsWhat to do at a byte no rule can lex: error_policy::raise (the default — throw) or error_policy::token (recover, emitting an scilex::error token). The recovery path never throws per byte; the token stream under raise is unchanged.
[in]columnsThe unit each token's position::column is counted in: column_unit::bytes (the default, unchanged), column_unit::codepoints, or column_unit::utf16. The unit is not stored on the position — read it back with columns().
Exceptions
std::invalid_argumentIf a transition rule is malformed (empty pattern or target), or insignificant_modes / dfa_modes names an unknown mode.

Definition at line 263 of file lexer.hpp.

Member Function Documentation

◆ add_to_mode()

void scilex::lexer::add_to_mode ( std::size_t  m,
std::size_t  idx 
)
inlineprivate

Adds rule idx to mode m's dispatch via REAL's exact first-byte API — the same 3-way split (nullable → general; one fixed byte → its bucket; otherwise → every bucket the set admits) as the mono-mode build.

Definition at line 479 of file lexer.hpp.

◆ advance()

void scilex::lexer::advance ( std::string_view  source,
position cursor,
std::size_t  n 
) const
inlineprivate

Advances cursor by n bytes of source, maintaining the 1-based line/column tracker (a newline resets the column). The shared advance point for a matched token, a recovery step, and the error-run scan.

Definition at line 657 of file lexer.hpp.

◆ audit_passes()

bool scilex::lexer::audit_passes ( const real::dfa &  candidate,
const std::vector< std::size_t > &  to_global,
std::size_t  mode 
) const
inlineprivate

The candidate DFA must reproduce the Pike munch on every probe: catches divergences the bytecode cannot reveal — chiefly a lazy quantifier, whose match() is the shortest span while the DFA takes the longest.

Definition at line 864 of file lexer.hpp.

◆ audit_probes()

std::vector< std::string > scilex::lexer::audit_probes ( const std::vector< std::size_t > &  to_global) const
inlineprivate

The bounded, deterministic probe inputs for the audit: every active rule's possible first bytes ∪ structural bytes, as singletons and short repeats (repeats expose lazy delimiters and quantifier boundaries — the hard cases), then fixed-seed random strings. At most 512 inputs, each ≤ 48 bytes.

Definition at line 812 of file lexer.hpp.

◆ build_dfa_modes()

void scilex::lexer::build_dfa_modes ( const std::unordered_set< std::string > &  dfa_modes)
inlineprivate

Opts the named dfa_modes into the DFA fast path (called once, after build_dispatch). For each, builds a real::dfa from the mode's active rules in ascending global index (= priority); an un-DFA-able assertion (real::dfa_error) or a failed audit_passes leaves the mode on Pike (nullptr). Best-effort — see dfa_modes_active.

Parameters
[in]dfa_modesThe opted-in mode names. The build-time equivalence audit always runs; its outcome is observable via dfa_modes_active.
Exceptions
std::invalid_argumentIf dfa_modes names an unknown mode.

Definition at line 925 of file lexer.hpp.

◆ build_dispatch()

void scilex::lexer::build_dispatch ( )
inlineprivate

Builds the per-mode first-byte dispatch from rules_ (once, at construction). "default" is mode 0; every name in a rule's rule::in_mode and every push/set target is interned to an id. For each mode, the rules active in it are bucketed by REAL's exact first-byte API, keeping rules_ order so the index tie-break in scan_next is the rule priority. Finishes with validate_transitions.

Definition at line 536 of file lexer.hpp.

◆ build_significance()

void scilex::lexer::build_significance ( const std::unordered_set< std::string > &  insignificant_modes)
inlineprivate

Builds the layout-significance policy from the insignificant-mode names (validated against the interned modes). With none, the policy stays empty — every mode significant, so scilex::layout is the positional pass (invariant 1).

Definition at line 576 of file lexer.hpp.

◆ column_step()

static std::size_t scilex::lexer::column_step ( std::string_view  source,
std::size_t  off,
scilex::column_unit  unit 
)
inlinestaticprivate

How much the column advances when the byte at off in source is consumed, under unit. bytes is always 1 (so the byte mode is the historical column == byte offset). codepoints counts one per valid codepoint (its lead scores 1, its continuations 0); utf16 scores 2 for an astral (4-byte) codepoint, 1 otherwise. A malformed byte — including an orphan continuation — scores 1 in every unit, so the column stays defined across the error runs recovery emits.

Definition at line 723 of file lexer.hpp.

◆ columns()

column_unit scilex::lexer::columns ( ) const
inlinenoexcept

The unit this lexer counts position::column in (positions do not carry it, so a consumer that needs to interpret a column reads the unit here).

Definition at line 279 of file lexer.hpp.

◆ dfa_modes_active()

std::vector< std::string > scilex::lexer::dfa_modes_active ( ) const
inline

The modes actually accelerated by a DFA fast path.

A mode passed in dfa_modes but rejected — its rules need an assertion no DFA can represent (real::dfa_error), or its DFA failed the build-time audit (a lazy quantifier) — is absent here: it fell back to the Pike path, lexing the same tokens. So dfa_modes is an optimizer, not a guarantee; the rejected set is dfa_modes − dfa_modes_active().

Definition at line 357 of file lexer.hpp.

◆ intern_mode()

std::size_t scilex::lexer::intern_mode ( const std::string &  name)
inlineprivate

Interns a mode name to its id, assigning the next id on first sight.

Definition at line 467 of file lexer.hpp.

◆ may_start()

bool scilex::lexer::may_start ( std::size_t  mode,
unsigned char  byte 
) const
inlineprivate

O(1) pre-filter for error recovery: can a fixed-lead rule in mode begin with byte? A false is conclusive (no rule can match, so the byte is error text); a true still needs a full munch_at to confirm a real match. This is what skips the bulk of noise cheaply.

Only the first-byte buckets are consulted, not the mode's general (nullable) rules: a nullable rule matches the empty string at every position, so a mode that had one could never reach the no-match case (#1) that starts a recovery run — it would report a zero-length error (#4) at the very first byte instead. So during recovery the active mode provably has no general rule.

Definition at line 633 of file lexer.hpp.

◆ mode_is_empty()

bool scilex::lexer::mode_is_empty ( std::size_t  m) const
inlineprivate

Whether mode m has no active rule (so nothing can match in it).

Definition at line 500 of file lexer.hpp.

◆ mode_name()

const std::string & scilex::lexer::mode_name ( std::size_t  id) const
inlinenoexcept

The name of mode id (0 is "default"), for labelling tokens.

Definition at line 345 of file lexer.hpp.

◆ mode_significant()

const std::vector< bool > & scilex::lexer::mode_significant ( ) const
inlinenoexcept

The per-mode-id layout-significance policy (see scilex::layout). Index by a token's mode_id; false marks an insignificant mode. Empty unless the lexer was built with insignificant_modes.

Definition at line 339 of file lexer.hpp.

◆ munch_at()

munch_result scilex::lexer::munch_at ( std::size_t  mode,
std::string_view  rest,
unsigned char  lead 
) const
inlineprivate

The winning munch in mode at the start of rest (lead is rest's first byte), dispatching to the mode's DFA when it has one, else the Pike + first-byte munch. The single match primitive both the forward scan and the error-recovery probe call, so the two never diverge on which rule wins.

Definition at line 611 of file lexer.hpp.

◆ pike_munch_in_mode()

munch_result scilex::lexer::pike_munch_in_mode ( std::size_t  mode,
std::string_view  rest,
unsigned char  lead 
) const
inlineprivate

The per-rule Pike + first-byte-dispatch munch in mode at the start of rest (lead is rest's first byte). Zero allocation; shared by scan_next's Pike branch and the DFA audit. A zero-length match is reported as a candidate (it wins only when nothing matches >0); scan_next's shared guard turns such a win into a lexical error.

Definition at line 753 of file lexer.hpp.

◆ position_label()

static std::string scilex::lexer::position_label ( position  where)
inlinestaticprivate

Formats a position as "line:column" for diagnostics.

Definition at line 373 of file lexer.hpp.

◆ rule_active_in_mode()

bool scilex::lexer::rule_active_in_mode ( std::size_t  idx,
std::size_t  mode 
) const
inlineprivate

Whether rule idx is active in mode mode (mirrors build_dispatch, an empty in_mode is the default mode only; otherwise the listed modes).

Definition at line 793 of file lexer.hpp.

◆ scan() [1/2]

token_range scilex::lexer::scan ( std::string_view  source,
eof_policy  policy = eof_policy::omit 
) const &
inline

Returns a lazy range over the non-skipped tokens of source.

Each ++ produces the next token on demand; nothing but the current token is held. Usable in a range-for. Errors surface as lex_error thrown while advancing.

Parameters
[in]sourceThe text to scan (must outlive the iteration; each token's lexeme views into it).
[in]policyWhether to yield a terminal end_of_input token.
Returns
A token_range whose iterators yield token values.
Exceptions
lex_error(while iterating) if some position matches no rule.

Definition at line 1117 of file lexer.hpp.

◆ scan() [2/2]

token_range scilex::lexer::scan ( std::string_view  source,
eof_policy  policy = eof_policy::omit 
) const &&
delete

Deleted: the range would point into a temporary lexer.

◆ scan_next()

bool scilex::lexer::scan_next ( std::string_view  source,
position cursor,
std::vector< frame > &  stack,
token out 
) const
inlineprivate

Advances cursor to and past the next non-skipped token in the active mode, applying the winning rule's mode transition (if any).

Skips over skip-rule matches, then fills out with the next emitted token, advancing cursor (offset, line, byte column) and the mode stack past it. The single source of scanning truth shared by tokenize and the lazy token_iterator.

Parameters
[in]sourceThe text being scanned.
[in,out]cursorCurrent position; advanced past the consumed bytes.
[in,out]stackThe mode stack (its top is the active mode); a winning rule's transition mutates it. Never empty.
[out]outReceives the next non-skipped token on success.
Returns
true if a token was produced, false at end of input.
Exceptions
lex_errorIf a position matches no rule in the active mode (#1), a rule pops at the stack root (#2), input ends inside a pushed mode (#3), or the winning match is zero-length and so cannot advance the scan (#4).

Definition at line 397 of file lexer.hpp.

◆ starts_a_match()

bool scilex::lexer::starts_a_match ( std::size_t  mode,
std::string_view  source,
std::size_t  offset 
) const
inlineprivate

Does a rule in mode match at offset in source? The error-recovery loop's stop test — the smallest such offset ends an error run. Since recovery never runs in a mode with a nullable rule (see may_start), any match here has positive length, so have alone is the stop condition (a zero-length win is impossible in this context).

Definition at line 643 of file lexer.hpp.

◆ tokenize()

std::vector< token > scilex::lexer::tokenize ( std::string_view  source,
eof_policy  policy = eof_policy::omit 
) const
inline

Tokenizes source into the sequence of non-skipped tokens.

At each position every rule is matched anchored; the longest match wins (ties broken by rule order). A zero-length winning match (a nullable rule with no longer match here) cannot advance the scan, so it is reported as a lex_error rather than allowed to stall.

Parameters
[in]sourceThe text to tokenize (must outlive the returned tokens; each token's lexeme views into it).
[in]policyWhether to append a terminal end_of_input token.
Returns
The tokens in source order, skip-rule matches omitted.
Exceptions
lex_errorIf some position is matched by no rule, or only by a zero-length match.

Definition at line 299 of file lexer.hpp.

◆ try_build_mode_dfa()

std::optional< mode_dfa > scilex::lexer::try_build_mode_dfa ( std::vector< std::size_t >  to_global,
std::size_t  mode 
)
inlineprivate

Builds the mode_dfa for one mode, or std::nullopt if the mode cannot take the DFA fast path. Two non-error reasons return nullopt: the rules contain an un-DFA-able assertion (real::dfa throws real::dfa_error — caught here, turned into nullopt), or the DFA fails the build-time equivalence audit (a lazy quantifier). Both leave the mode on Pike. Returning the outcome instead of throwing past the caller keeps the fast-path decision explicit at the call site.

Parameters
[in]to_globalThe mode's active rules, ascending global index (priority).
[in]modeThe mode id (for the audit).

Definition at line 897 of file lexer.hpp.

◆ valid_utf8_len()

static std::size_t scilex::lexer::valid_utf8_len ( std::string_view  s,
std::size_t  off 
)
inlinestaticprivate

The length (1–4) of a valid UTF-8 codepoint starting at off in s, or 0 when the byte there is not a valid lead — a continuation byte, a truncated/over­long/surrogate/ out-of-range sequence, or an invalid lead. The column stepper's UTF-8 validator.

Definition at line 676 of file lexer.hpp.

◆ validate_transitions()

void scilex::lexer::validate_transitions ( ) const
inlineprivate

Fail-fast transition checks: a transition rule must consume input, and a push/set target must be a defined, non-empty mode.

Exceptions
std::invalid_argumenton a violation.

Definition at line 511 of file lexer.hpp.

Friends And Related Symbol Documentation

◆ token_iterator

friend class token_iterator
friend

Definition at line 370 of file lexer.hpp.

Member Data Documentation

◆ columns_

scilex::column_unit scilex::lexer::columns_
private

The unit position::column is counted in.

Definition at line 955 of file lexer.hpp.

◆ errors_

error_policy scilex::lexer::errors_
private

What to do at an unmatched byte.

Definition at line 954 of file lexer.hpp.

◆ mode_id_

std::map<std::string, std::size_t> scilex::lexer::mode_id_
private

Mode name -> id.

Definition at line 957 of file lexer.hpp.

◆ mode_names_

std::vector<std::string> scilex::lexer::mode_names_
private

Mode id -> name ("default" is id 0).

Definition at line 956 of file lexer.hpp.

◆ mode_significant_

std::vector<bool> scilex::lexer::mode_significant_
private

Layout policy (empty = all significant).

Definition at line 960 of file lexer.hpp.

◆ per_mode_

std::vector<dispatch> scilex::lexer::per_mode_
private

Dispatch index, one per mode (by id).

Definition at line 958 of file lexer.hpp.

◆ per_mode_dfa_

std::vector<std::shared_ptr<const mode_dfa> > scilex::lexer::per_mode_dfa_
private

Per-mode DFA fast path (nullptr = Pike).

Definition at line 959 of file lexer.hpp.

◆ rules_

std::vector<rule> scilex::lexer::rules_
private

The ordered token rules.

Definition at line 953 of file lexer.hpp.


The documentation for this class was generated from the following file: