SciLex
A header-only C++20 lexer built on REAL
Loading...
Searching...
No Matches
Classes | Enumerations | Functions | Variables
scilex Namespace Reference

The SciLex public API (scilex::lexer, scilex::rule, scilex::token). More...

Classes

struct  frame
 One entry on the per-scan mode stack: the active mode and where it was entered (the entry position feeds the unterminated/diagnostic messages). More...
 
class  layout_error
 Thrown when a line's indentation matches no enclosing level. More...
 
class  lex_error
 Thrown when no rule matches at a position (a lexical error). More...
 
class  lexer
 A lexer built from an ordered list of rules. More...
 
struct  mode_action
 A mode transition, fired when its rule wins, acting on the scan's mode stack: enter a nested mode, leave the current one, or replace it. More...
 
struct  position
 A location in the source text. More...
 
struct  rule
 A token rule: a kind, the pattern that recognizes it, whether matches are discarded (whitespace, comments), and — for contextual lexing — the modes it is active in and an optional mode transition it fires when it wins. More...
 
struct  token
 One lexical token: a typed slice of the source. More...
 
class  token_iterator
 Forward (single-pass) iterator yielding one token at a time. More...
 
class  token_range
 A lazy range of tokens, returned by lexer::scan. More...
 

Enumerations

enum class  eof_policy { omit , append }
 Whether tokenization appends a synthetic end-of-input token. More...
 
enum class  error_policy { raise , token }
 What a lexer does when it reaches a byte that no rule in the active mode can begin. More...
 
enum class  column_unit { bytes , codepoints , utf16 }
 The unit a token's position::column is counted in. More...
 

Functions

std::vector< tokenlayout (std::span< const token > tokens, const std::vector< bool > &mode_significant={})
 Rewrites tokens with NEWLINE / INDENT / DEDENT inserted.
 
void apply_transition (const rule &r, position start, std::vector< frame > &stack)
 Applies rule r's mode transition (if any) to stack — the per-scan mode-stack mutation, kept pure so the lexer and the fuzz oracle share it verbatim.
 

Variables

constexpr int newline {std::numeric_limits<int>::min() + 1}
 Reserved kind: end of a logical line.
 
constexpr int indent {std::numeric_limits<int>::min() + 2}
 Reserved kind: indentation increased (start of a deeper block).
 
constexpr int dedent {std::numeric_limits<int>::min() + 3}
 Reserved kind: indentation decreased (end of a block).
 
constexpr int end_of_input {std::numeric_limits<int>::min()}
 Reserved token kind for the synthetic end-of-input token.
 
constexpr int error {std::numeric_limits<int>::min() + 4}
 Reserved token kind for a lexical-error run under scilex::error_policy::token.
 

Detailed Description

The SciLex public API (scilex::lexer, scilex::rule, scilex::token).

Enumeration Type Documentation

◆ column_unit

enum class scilex::column_unit
strong

The unit a token's position::column is counted in.

The default bytes is the historical behaviour, bit-for-bit. codepoints counts Unicode scalar values (each valid UTF-8 codepoint is one column), and utf16 counts UTF-16 code units (a BMP codepoint is 1, an astral codepoint 2) — the unit an LSP client expects. A malformed byte (an orphan continuation, an overlong or out-of-range sequence) counts as one unit in every mode, so the column stays defined on the error runs error_policy::token emits. The chosen unit is not carried on position (one field, not self-describing) — the lexer declares it via lexer::columns, a named trade-off rather than a silent default.

Enumerator
bytes 

One column per byte (the default; column == byte offset within the line + 1).

codepoints 

One column per Unicode scalar value (a valid UTF-8 codepoint).

utf16 

One column per UTF-16 code unit (BMP = 1, astral = 2) — the LSP unit.

Definition at line 221 of file lexer.hpp.

◆ eof_policy

enum class scilex::eof_policy
strong

Whether tokenization appends a synthetic end-of-input token.

eof_policy::append yields one final token of kind end_of_input at the end position once the input is exhausted — the parser-friendly mode, so a cursor always has a current token to match against.

Enumerator
omit 

Stop at the last real token (default).

append 

Append one end_of_input token at the end position.

Definition at line 53 of file lexer.hpp.

◆ error_policy

enum class scilex::error_policy
strong

What a lexer does when it reaches a byte that no rule in the active mode can begin.

The default preserves the historical behaviour exactly; token is opt-in recovery.

Enumerator
raise 

Throw a lex_error at the first unmatched byte (the default).

token 

Recover: emit the maximal unmatched byte run as one scilex::error token and resume. The cost of an error run is the grammar's no-match cost: a first-byte pre-filter skips positions no rule can begin (usually O(1) per byte), so an unanchored, greedy rule that scans far before failing is what makes recovery expensive on a long run — prefer a definite leading byte.

Definition at line 200 of file lexer.hpp.

Function Documentation

◆ apply_transition()

void scilex::apply_transition ( const rule r,
position  start,
std::vector< frame > &  stack 
)
inline

Applies rule r's mode transition (if any) to stack — the per-scan mode-stack mutation, kept pure so the lexer and the fuzz oracle share it verbatim.

Depends only on r's action (its pre-resolved mode_action::target_id), the token start start, and the stack; it mutates only stack. push enters the target (remembering start), pop leaves the current mode, set replaces it in place. The target id is resolved once at build time, so this hot per-token pivot does no name→id map lookup.

Exceptions
lex_errorOn a pop while the stack is at its root (nothing to leave).

Definition at line 174 of file lexer.hpp.

◆ layout()

std::vector< token > scilex::layout ( std::span< const token tokens,
const std::vector< bool > &  mode_significant = {} 
)
inline

Rewrites tokens with NEWLINE / INDENT / DEDENT inserted.

Parameters
[in]tokensAn end-of-input-terminated token sequence.
[in]mode_significantPer-mode-id significance policy (Layout Awareness Level A): index by a token's mode_id; true (or a mode-id beyond the vector) means the token shapes layout, false means it is passed through without affecting indentation. An empty vector (the default) means every token is significant — byte-for-byte the positional pass. (A std::vector<bool> rather than a std::span<const bool>: the bit-packed vector<bool> cannot be viewed as a contiguous span of bool.)
Returns
The layout-aware token sequence (still end-of-input-terminated).
Exceptions
layout_errorIf a line dedents to an indentation that no open block used.

Definition at line 102 of file layout.hpp.

Variable Documentation

◆ dedent

constexpr int scilex::dedent {std::numeric_limits<int>::min() + 3}
inlineconstexpr

Reserved kind: indentation decreased (end of a block).

Definition at line 54 of file layout.hpp.

◆ end_of_input

constexpr int scilex::end_of_input {std::numeric_limits<int>::min()}
inlineconstexpr

Reserved token kind for the synthetic end-of-input token.

Emitted at the end of the input when tokenizing with scilex::eof_policy::append (the parser-friendly mode: there is always a current token, including a terminal one to match). SciLex reserves this value; user-defined token kinds must not use it.

Definition at line 26 of file token.hpp.

◆ error

constexpr int scilex::error {std::numeric_limits<int>::min() + 4}
inlineconstexpr

Reserved token kind for a lexical-error run under scilex::error_policy::token.

When error recovery is enabled, a maximal run of bytes that no rule in the active mode can begin is emitted as one token of this kind (its token::lexeme is the exact offending bytes), instead of throwing. Part of the reserved family: end_of_input is min(), the layout kinds (scilex::newline / indent / dedent in layout.hpp) are min()+1..+3, so this takes the next free slot, min()+4. User-defined token kinds must not use it.

Definition at line 37 of file token.hpp.

◆ indent

constexpr int scilex::indent {std::numeric_limits<int>::min() + 2}
inlineconstexpr

Reserved kind: indentation increased (start of a deeper block).

Definition at line 52 of file layout.hpp.

◆ newline

constexpr int scilex::newline {std::numeric_limits<int>::min() + 1}
inlineconstexpr

Reserved kind: end of a logical line.

Definition at line 50 of file layout.hpp.