|
SciLex
A header-only C++20 lexer built on REAL
|
The lexer: maximal-munch tokenization over a set of REAL patterns. More...
#include <algorithm>#include <array>#include <iterator>#include <map>#include <memory>#include <optional>#include <random>#include <span>#include <stdexcept>#include <string>#include <string_view>#include <tuple>#include <unordered_set>#include <utility>#include <vector>#include <real/dfa.hpp>#include <real/real.hpp>#include "token.hpp"Go to the source code of this file.
Classes | |
| struct | scilex::mode_action |
| A mode transition, fired when its rule wins, acting on the scan's mode stack: enter a nested mode, leave the current one, or replace it. More... | |
| struct | scilex::rule |
| A token rule: a kind, the pattern that recognizes it, whether matches are discarded (whitespace, comments), and — for contextual lexing — the modes it is active in and an optional mode transition it fires when it wins. More... | |
| class | scilex::lex_error |
| Thrown when no rule matches at a position (a lexical error). More... | |
| struct | scilex::frame |
| One entry on the per-scan mode stack: the active mode and where it was entered (the entry position feeds the unterminated/diagnostic messages). More... | |
| class | scilex::lexer |
| A lexer built from an ordered list of rules. More... | |
| struct | scilex::lexer::mode_dfa |
| An adopted per-mode DFA: the automaton plus its local→global rule map. More... | |
| struct | scilex::lexer::munch_result |
| A munch decision: whether a rule matched, which (global index), how many bytes — the small value scan_next's Pike branch and the audit share. More... | |
| struct | scilex::lexer::dispatch |
| Per-mode dispatch index: the first-byte buckets scoped to one mode. More... | |
| class | scilex::token_iterator |
| Forward (single-pass) iterator yielding one token at a time. More... | |
| class | scilex::token_range |
| A lazy range of tokens, returned by lexer::scan. More... | |
Namespaces | |
| namespace | scilex |
| The SciLex public API (scilex::lexer, scilex::rule, scilex::token). | |
Enumerations | |
| enum class | scilex::eof_policy { scilex::omit , scilex::append } |
| Whether tokenization appends a synthetic end-of-input token. More... | |
| enum class | scilex::error_policy { scilex::raise , scilex::token } |
| What a lexer does when it reaches a byte that no rule in the active mode can begin. More... | |
| enum class | scilex::column_unit { scilex::bytes , scilex::codepoints , scilex::utf16 } |
The unit a token's position::column is counted in. More... | |
Functions | |
| void | scilex::apply_transition (const rule &r, position start, std::vector< frame > &stack) |
Applies rule r's mode transition (if any) to stack — the per-scan mode-stack mutation, kept pure so the lexer and the fuzz oracle share it verbatim. | |
The lexer: maximal-munch tokenization over a set of REAL patterns.
SciLex is a thin layer over REAL. Each scilex::rule pairs a token kind with a real::regex; the lexer scans the source left to right, and at each position picks the rule with the longest anchored match (maximal munch), breaking ties by rule order (earlier rules have priority). Because REAL is a linear-time engine, tokenization is linear and ReDoS-safe by construction — no token rule can make the scanner backtrack catastrophically.
Two ways to consume tokens: scilex::lexer::tokenize materializes them all into a vector, while scilex::lexer::scan returns a lazy range that produces one token at a time (the parser-friendly access pattern — no token vector is allocated).
Definition in file lexer.hpp.