The lexer: maximal-munch tokenization over a set of REAL patterns. More...

#include <algorithm>
#include <array>
#include <iterator>
#include <map>
#include <memory>
#include <optional>
#include <random>
#include <span>
#include <stdexcept>
#include <string>
#include <string_view>
#include <tuple>
#include <unordered_set>
#include <utility>
#include <vector>
#include <real/dfa.hpp>
#include <real/real.hpp>
#include "token.hpp"

Include dependency graph for lexer.hpp:

Classes
struct	scilex::mode_action
	A mode transition, fired when its rule wins, acting on the scan's mode stack: enter a nested mode, leave the current one, or replace it. More...

struct	scilex::rule
	A token rule: a kind, the pattern that recognizes it, whether matches are discarded (whitespace, comments), and — for contextual lexing — the modes it is active in and an optional mode transition it fires when it wins. More...

class	scilex::lex_error
	Thrown when no rule matches at a position (a lexical error). More...

struct	scilex::frame
	One entry on the per-scan mode stack: the active mode and where it was entered (the entry position feeds the unterminated/diagnostic messages). More...

class	scilex::lexer
	A lexer built from an ordered list of rules. More...

struct	scilex::lexer::mode_dfa
	An adopted per-mode DFA: the automaton plus its local→global rule map. More...

struct	scilex::lexer::munch_result
	A munch decision: whether a rule matched, which (global index), how many bytes — the small value scan_next's Pike branch and the audit share. More...

struct	scilex::lexer::dispatch
	Per-mode dispatch index: the first-byte buckets scoped to one mode. More...

class	scilex::token_iterator
	Forward (single-pass) iterator yielding one token at a time. More...

class	scilex::token_range
	A lazy range of tokens, returned by lexer::scan. More...

Namespaces
namespace	scilex
	The SciLex public API (scilex::lexer, scilex::rule, scilex::token).

Enumerations
enum class	scilex::eof_policy { scilex::omit , scilex::append }
	Whether tokenization appends a synthetic end-of-input token. More...

enum class	scilex::error_policy { scilex::raise , scilex::token }
	What a lexer does when it reaches a byte that no rule in the active mode can begin. More...

enum class	scilex::column_unit { scilex::bytes , scilex::codepoints , scilex::utf16 }
	The unit a token's `position::column` is counted in. More...

Functions
void	scilex::apply_transition (const rule &r, position start, std::vector< frame > &stack)
	Applies rule `r's` mode transition (if any) to `stack` — the per-scan mode-stack mutation, kept pure so the lexer and the fuzz oracle share it verbatim.

Detailed Description

The lexer: maximal-munch tokenization over a set of REAL patterns.

SciLex is a thin layer over REAL. Each scilex::rule pairs a token kind with a real::regex; the lexer scans the source left to right, and at each position picks the rule with the longest anchored match (maximal munch), breaking ties by rule order (earlier rules have priority). Because REAL is a linear-time engine, tokenization is linear and ReDoS-safe by construction — no token rule can make the scanner backtrack catastrophically.

Two ways to consume tokens: scilex::lexer::tokenize materializes them all into a vector, while scilex::lexer::scan returns a lazy range that produces one token at a time (the parser-friendly access pattern — no token vector is allocated).

Definition in file lexer.hpp.

Classes

Namespaces

Enumerations

Functions

Detailed Description