SciLex
A header-only C++20 lexer built on REAL
Loading...
Searching...
No Matches
Classes | Namespaces | Enumerations | Functions
lexer.hpp File Reference

The lexer: maximal-munch tokenization over a set of REAL patterns. More...

#include <algorithm>
#include <array>
#include <iterator>
#include <map>
#include <memory>
#include <optional>
#include <random>
#include <span>
#include <stdexcept>
#include <string>
#include <string_view>
#include <tuple>
#include <unordered_set>
#include <utility>
#include <vector>
#include <real/dfa.hpp>
#include <real/real.hpp>
#include "token.hpp"
Include dependency graph for lexer.hpp:

Go to the source code of this file.

Classes

struct  scilex::mode_action
 A mode transition, fired when its rule wins, acting on the scan's mode stack: enter a nested mode, leave the current one, or replace it. More...
 
struct  scilex::rule
 A token rule: a kind, the pattern that recognizes it, whether matches are discarded (whitespace, comments), and — for contextual lexing — the modes it is active in and an optional mode transition it fires when it wins. More...
 
class  scilex::lex_error
 Thrown when no rule matches at a position (a lexical error). More...
 
struct  scilex::frame
 One entry on the per-scan mode stack: the active mode and where it was entered (the entry position feeds the unterminated/diagnostic messages). More...
 
class  scilex::lexer
 A lexer built from an ordered list of rules. More...
 
struct  scilex::lexer::mode_dfa
 An adopted per-mode DFA: the automaton plus its local→global rule map. More...
 
struct  scilex::lexer::munch_result
 A munch decision: whether a rule matched, which (global index), how many bytes — the small value scan_next's Pike branch and the audit share. More...
 
struct  scilex::lexer::dispatch
 Per-mode dispatch index: the first-byte buckets scoped to one mode. More...
 
class  scilex::token_iterator
 Forward (single-pass) iterator yielding one token at a time. More...
 
class  scilex::token_range
 A lazy range of tokens, returned by lexer::scan. More...
 

Namespaces

namespace  scilex
 The SciLex public API (scilex::lexer, scilex::rule, scilex::token).
 

Enumerations

enum class  scilex::eof_policy { scilex::omit , scilex::append }
 Whether tokenization appends a synthetic end-of-input token. More...
 
enum class  scilex::error_policy { scilex::raise , scilex::token }
 What a lexer does when it reaches a byte that no rule in the active mode can begin. More...
 
enum class  scilex::column_unit { scilex::bytes , scilex::codepoints , scilex::utf16 }
 The unit a token's position::column is counted in. More...
 

Functions

void scilex::apply_transition (const rule &r, position start, std::vector< frame > &stack)
 Applies rule r's mode transition (if any) to stack — the per-scan mode-stack mutation, kept pure so the lexer and the fuzz oracle share it verbatim.
 

Detailed Description

The lexer: maximal-munch tokenization over a set of REAL patterns.

SciLex is a thin layer over REAL. Each scilex::rule pairs a token kind with a real::regex; the lexer scans the source left to right, and at each position picks the rule with the longest anchored match (maximal munch), breaking ties by rule order (earlier rules have priority). Because REAL is a linear-time engine, tokenization is linear and ReDoS-safe by construction — no token rule can make the scanner backtrack catastrophically.

Two ways to consume tokens: scilex::lexer::tokenize materializes them all into a vector, while scilex::lexer::scan returns a lazy range that produces one token at a time (the parser-friendly access pattern — no token vector is allocated).

Definition in file lexer.hpp.