The SciLex public API (scilex::lexer, scilex::rule, scilex::token).
More...
|
| struct | frame |
| | One entry on the per-scan mode stack: the active mode and where it was entered (the entry position feeds the unterminated/diagnostic messages). More...
|
| |
| class | layout_error |
| | Thrown when a line's indentation matches no enclosing level. More...
|
| |
| class | lex_error |
| | Thrown when no rule matches at a position (a lexical error). More...
|
| |
| class | lexer |
| | A lexer built from an ordered list of rules. More...
|
| |
| struct | mode_action |
| | A mode transition, fired when its rule wins, acting on the scan's mode stack: enter a nested mode, leave the current one, or replace it. More...
|
| |
| struct | position |
| | A location in the source text. More...
|
| |
| struct | rule |
| | A token rule: a kind, the pattern that recognizes it, whether matches are discarded (whitespace, comments), and — for contextual lexing — the modes it is active in and an optional mode transition it fires when it wins. More...
|
| |
| struct | token |
| | One lexical token: a typed slice of the source. More...
|
| |
| class | token_iterator |
| | Forward (single-pass) iterator yielding one token at a time. More...
|
| |
| class | token_range |
| | A lazy range of tokens, returned by lexer::scan. More...
|
| |
|
| std::vector< token > | layout (std::span< const token > tokens, const std::vector< bool > &mode_significant={}) |
| | Rewrites tokens with NEWLINE / INDENT / DEDENT inserted.
|
| |
| void | apply_transition (const rule &r, position start, std::vector< frame > &stack) |
| | Applies rule r's mode transition (if any) to stack — the per-scan mode-stack mutation, kept pure so the lexer and the fuzz oracle share it verbatim.
|
| |
|
| constexpr int | newline {std::numeric_limits<int>::min() + 1} |
| | Reserved kind: end of a logical line.
|
| |
| constexpr int | indent {std::numeric_limits<int>::min() + 2} |
| | Reserved kind: indentation increased (start of a deeper block).
|
| |
| constexpr int | dedent {std::numeric_limits<int>::min() + 3} |
| | Reserved kind: indentation decreased (end of a block).
|
| |
| constexpr int | end_of_input {std::numeric_limits<int>::min()} |
| | Reserved token kind for the synthetic end-of-input token.
|
| |
| constexpr int | error {std::numeric_limits<int>::min() + 4} |
| | Reserved token kind for a lexical-error run under scilex::error_policy::token.
|
| |
◆ column_unit
The unit a token's position::column is counted in.
The default bytes is the historical behaviour, bit-for-bit. codepoints counts Unicode scalar values (each valid UTF-8 codepoint is one column), and utf16 counts UTF-16 code units (a BMP codepoint is 1, an astral codepoint 2) — the unit an LSP client expects. A malformed byte (an orphan continuation, an overlong or out-of-range sequence) counts as one unit in every mode, so the column stays defined on the error runs error_policy::token emits. The chosen unit is not carried on position (one field, not self-describing) — the lexer declares it via lexer::columns, a named trade-off rather than a silent default.
| Enumerator |
|---|
| bytes | One column per byte (the default; column == byte offset within the line + 1).
|
| codepoints | One column per Unicode scalar value (a valid UTF-8 codepoint).
|
| utf16 | One column per UTF-16 code unit (BMP = 1, astral = 2) — the LSP unit.
|
Definition at line 221 of file lexer.hpp.
◆ eof_policy
Whether tokenization appends a synthetic end-of-input token.
eof_policy::append yields one final token of kind end_of_input at the end position once the input is exhausted — the parser-friendly mode, so a cursor always has a current token to match against.
| Enumerator |
|---|
| omit | Stop at the last real token (default).
|
| append | Append one end_of_input token at the end position.
|
Definition at line 53 of file lexer.hpp.
◆ error_policy
What a lexer does when it reaches a byte that no rule in the active mode can begin.
The default preserves the historical behaviour exactly; token is opt-in recovery.
| Enumerator |
|---|
| raise | Throw a lex_error at the first unmatched byte (the default).
|
| token | Recover: emit the maximal unmatched byte run as one scilex::error token and resume. The cost of an error run is the grammar's no-match cost: a first-byte pre-filter skips positions no rule can begin (usually O(1) per byte), so an unanchored, greedy rule that scans far before failing is what makes recovery expensive on a long run — prefer a definite leading byte.
|
Definition at line 200 of file lexer.hpp.
◆ apply_transition()
| void scilex::apply_transition |
( |
const rule & |
r, |
|
|
position |
start, |
|
|
std::vector< frame > & |
stack |
|
) |
| |
|
inline |
Applies rule r's mode transition (if any) to stack — the per-scan mode-stack mutation, kept pure so the lexer and the fuzz oracle share it verbatim.
Depends only on r's action (its pre-resolved mode_action::target_id), the token start start, and the stack; it mutates only stack. push enters the target (remembering start), pop leaves the current mode, set replaces it in place. The target id is resolved once at build time, so this hot per-token pivot does no name→id map lookup.
- Exceptions
-
| lex_error | On a pop while the stack is at its root (nothing to leave). |
Definition at line 174 of file lexer.hpp.
◆ layout()
| std::vector< token > scilex::layout |
( |
std::span< const token > |
tokens, |
|
|
const std::vector< bool > & |
mode_significant = {} |
|
) |
| |
|
inline |
Rewrites tokens with NEWLINE / INDENT / DEDENT inserted.
- Parameters
-
| [in] | tokens | An end-of-input-terminated token sequence. |
| [in] | mode_significant | Per-mode-id significance policy (Layout Awareness Level A): index by a token's mode_id; true (or a mode-id beyond the vector) means the token shapes layout, false means it is passed through without affecting indentation. An empty vector (the default) means every token is significant — byte-for-byte the positional pass. (A std::vector<bool> rather than a std::span<const bool>: the bit-packed vector<bool> cannot be viewed as a contiguous span of bool.) |
- Returns
- The layout-aware token sequence (still end-of-input-terminated).
- Exceptions
-
| layout_error | If a line dedents to an indentation that no open block used. |
Definition at line 102 of file layout.hpp.
◆ dedent
| constexpr int scilex::dedent {std::numeric_limits<int>::min() + 3} |
|
inlineconstexpr |
Reserved kind: indentation decreased (end of a block).
Definition at line 54 of file layout.hpp.
◆ end_of_input
| constexpr int scilex::end_of_input {std::numeric_limits<int>::min()} |
|
inlineconstexpr |
Reserved token kind for the synthetic end-of-input token.
Emitted at the end of the input when tokenizing with scilex::eof_policy::append (the parser-friendly mode: there is always a current token, including a terminal one to match). SciLex reserves this value; user-defined token kinds must not use it.
Definition at line 26 of file token.hpp.
◆ error
| constexpr int scilex::error {std::numeric_limits<int>::min() + 4} |
|
inlineconstexpr |
Reserved token kind for a lexical-error run under scilex::error_policy::token.
When error recovery is enabled, a maximal run of bytes that no rule in the active mode can begin is emitted as one token of this kind (its token::lexeme is the exact offending bytes), instead of throwing. Part of the reserved family: end_of_input is min(), the layout kinds (scilex::newline / indent / dedent in layout.hpp) are min()+1..+3, so this takes the next free slot, min()+4. User-defined token kinds must not use it.
Definition at line 37 of file token.hpp.
◆ indent
| constexpr int scilex::indent {std::numeric_limits<int>::min() + 2} |
|
inlineconstexpr |
Reserved kind: indentation increased (start of a deeper block).
Definition at line 52 of file layout.hpp.
◆ newline
| constexpr int scilex::newline {std::numeric_limits<int>::min() + 1} |
|
inlineconstexpr |
Reserved kind: end of a logical line.
Definition at line 50 of file layout.hpp.