The SciLex public API (scilex::lexer, scilex::rule, scilex::token). More...

Classes
struct	frame
	One entry on the per-scan mode stack: the active mode and where it was entered (the entry position feeds the unterminated/diagnostic messages). More...

class	layout_error
	Thrown when a line's indentation matches no enclosing level. More...

class	lex_error
	Thrown when no rule matches at a position (a lexical error). More...

class	lexer
	A lexer built from an ordered list of rules. More...

struct	mode_action
	A mode transition, fired when its rule wins, acting on the scan's mode stack: enter a nested mode, leave the current one, or replace it. More...

struct	position
	A location in the source text. More...

struct	rule
	A token rule: a kind, the pattern that recognizes it, whether matches are discarded (whitespace, comments), and — for contextual lexing — the modes it is active in and an optional mode transition it fires when it wins. More...

struct	token
	One lexical token: a typed slice of the source. More...

class	token_iterator
	Forward (single-pass) iterator yielding one token at a time. More...

class	token_range
	A lazy range of tokens, returned by lexer::scan. More...

Enumerations
enum class	eof_policy { omit , append }
	Whether tokenization appends a synthetic end-of-input token. More...

enum class	error_policy { raise , token }
	What a lexer does when it reaches a byte that no rule in the active mode can begin. More...

enum class	column_unit { bytes , codepoints , utf16 }
	The unit a token's `position::column` is counted in. More...

Functions
std::vector< token >	layout (std::span< const token > tokens, const std::vector< bool > &mode_significant={})
	Rewrites `tokens` with NEWLINE / INDENT / DEDENT inserted.

void	apply_transition (const rule &r, position start, std::vector< frame > &stack)
	Applies rule `r's` mode transition (if any) to `stack` — the per-scan mode-stack mutation, kept pure so the lexer and the fuzz oracle share it verbatim.

Variables
constexpr int	newline {std::numeric_limits<int>::min() + 1}
	Reserved kind: end of a logical line.

constexpr int	indent {std::numeric_limits<int>::min() + 2}
	Reserved kind: indentation increased (start of a deeper block).

constexpr int	dedent {std::numeric_limits<int>::min() + 3}
	Reserved kind: indentation decreased (end of a block).

constexpr int	end_of_input {std::numeric_limits<int>::min()}
	Reserved token kind for the synthetic end-of-input token.

constexpr int	error {std::numeric_limits<int>::min() + 4}
	Reserved token kind for a lexical-error run under scilex::error_policy::token.

Detailed Description

The SciLex public API (scilex::lexer, scilex::rule, scilex::token).

Enumeration Type Documentation

◆ column_unit

enum class scilex::column_unit

strong

The unit a token's position::column is counted in.

The default bytes is the historical behaviour, bit-for-bit. codepoints counts Unicode scalar values (each valid UTF-8 codepoint is one column), and utf16 counts UTF-16 code units (a BMP codepoint is 1, an astral codepoint 2) — the unit an LSP client expects. A malformed byte (an orphan continuation, an overlong or out-of-range sequence) counts as one unit in every mode, so the column stays defined on the error runs error_policy::token emits. The chosen unit is not carried on position (one field, not self-describing) — the lexer declares it via lexer::columns, a named trade-off rather than a silent default.

Enumerator
bytes	One column per byte (the default; column == byte offset within the line + 1).
codepoints	One column per Unicode scalar value (a valid UTF-8 codepoint).
utf16	One column per UTF-16 code unit (BMP = 1, astral = 2) — the LSP unit.

Definition at line 221 of file lexer.hpp.

◆ eof_policy

enum class scilex::eof_policy

strong

Whether tokenization appends a synthetic end-of-input token.

eof_policy::append yields one final token of kind end_of_input at the end position once the input is exhausted — the parser-friendly mode, so a cursor always has a current token to match against.

Enumerator
omit	Stop at the last real token (default).
append	Append one end_of_input token at the end position.

Definition at line 53 of file lexer.hpp.

◆ error_policy

enum class scilex::error_policy

strong

What a lexer does when it reaches a byte that no rule in the active mode can begin.

The default preserves the historical behaviour exactly; token is opt-in recovery.

Enumerator
raise	Throw a lex_error at the first unmatched byte (the default).
token	Recover: emit the maximal unmatched byte run as one scilex::error token and resume. The cost of an error run is the grammar's no-match cost: a first-byte pre-filter skips positions no rule can begin (usually O(1) per byte), so an unanchored, greedy rule that scans far before failing is what makes recovery expensive on a long run — prefer a definite leading byte.

Definition at line 200 of file lexer.hpp.

Function Documentation

◆ apply_transition()

void scilex::apply_transition	(	const rule &	r,
		position	start,
		std::vector< frame > &	stack
	)

inline

Applies rule r's mode transition (if any) to stack — the per-scan mode-stack mutation, kept pure so the lexer and the fuzz oracle share it verbatim.

Depends only on r's action (its pre-resolved mode_action::target_id), the token start start, and the stack; it mutates only stack. push enters the target (remembering start), pop leaves the current mode, set replaces it in place. The target id is resolved once at build time, so this hot per-token pivot does no name→id map lookup.

Exceptions

lex_error On a pop while the stack is at its root (nothing to leave).

Definition at line 174 of file lexer.hpp.

◆ layout()

std::vector< token > scilex::layout	(	std::span< const token >	tokens,
		const std::vector< bool > &	mode_significant = `{}`
	)

inline

Rewrites tokens with NEWLINE / INDENT / DEDENT inserted.

Parameters

[in]	tokens	An end-of-input-terminated token sequence.
[in]	mode_significant	Per-mode-id significance policy (Layout Awareness Level A): index by a token's `mode_id`; `true` (or a mode-id beyond the vector) means the token shapes layout, `false` means it is passed through without affecting indentation. An empty vector (the default) means every token is significant — byte-for-byte the positional pass. (A `std::vector<bool>` rather than a `std::span<const bool>`: the bit-packed `vector<bool>` cannot be viewed as a contiguous span of `bool`.)

Returns: The layout-aware token sequence (still end-of-input-terminated).

Exceptions

layout_error If a line dedents to an indentation that no open block used.

Definition at line 102 of file layout.hpp.

Variable Documentation

◆ dedent

constexpr int scilex::dedent {std::numeric_limits<int>::min() + 3}

inlineconstexpr

Reserved kind: indentation decreased (end of a block).

Definition at line 54 of file layout.hpp.

◆ end_of_input

constexpr int scilex::end_of_input {std::numeric_limits<int>::min()}

inlineconstexpr

Reserved token kind for the synthetic end-of-input token.

Emitted at the end of the input when tokenizing with scilex::eof_policy::append (the parser-friendly mode: there is always a current token, including a terminal one to match). SciLex reserves this value; user-defined token kinds must not use it.

Definition at line 26 of file token.hpp.

◆ error

constexpr int scilex::error {std::numeric_limits<int>::min() + 4}

inlineconstexpr

Reserved token kind for a lexical-error run under scilex::error_policy::token.

When error recovery is enabled, a maximal run of bytes that no rule in the active mode can begin is emitted as one token of this kind (its token::lexeme is the exact offending bytes), instead of throwing. Part of the reserved family: end_of_input is min(), the layout kinds (scilex::newline / indent / dedent in layout.hpp) are min()+1..+3, so this takes the next free slot, min()+4. User-defined token kinds must not use it.

Definition at line 37 of file token.hpp.

◆ indent

constexpr int scilex::indent {std::numeric_limits<int>::min() + 2}

inlineconstexpr

Reserved kind: indentation increased (start of a deeper block).

Definition at line 52 of file layout.hpp.

◆ newline

constexpr int scilex::newline {std::numeric_limits<int>::min() + 1}

inlineconstexpr

Reserved kind: end of a logical line.

Definition at line 50 of file layout.hpp.

Classes

Enumerations

Functions

Variables

Detailed Description

Enumeration Type Documentation

◆ column_unit

◆ eof_policy

◆ error_policy

Function Documentation

◆ apply_transition()

◆ layout()

Variable Documentation

◆ dedent

◆ end_of_input

◆ error

◆ indent

◆ newline