|
SciLex
A header-only C++20 lexer built on REAL
|
A small, header-only C++20 contextual lexer built on REAL.
tokenize or lazy scan; positioned errors with a context snippet.Define an ordered set of token rules — each a (kind, regex, skip) triple — and SciLex tokenizes by maximal munch: the longest anchored match wins, with rule order breaking ties. A rule can also opt into modes (contextual lexing), so the same byte lexes differently by context. Because it is a thin layer over REAL, tokenization is linear and ReDoS-safe by construction.
What that covers today: significant indentation, plus contexts like f-strings, YAML flow collections, and bracket continuation (modes + Layout Awareness Level A). Cases that need a deeper lexing↔indentation coupling — YAML block scalars | / >, heredocs — are Level B: documented, not in this version.
This follows the same design principles as REAL: purity, simplicity, and measured optimality.
(kind, real::regex, skip)in_mode + a push / pop / set mode stackdfa_modes accelerates DFA-able modes ~20× with one real::dfa pass; best-effort (Pike is the floor, with fallback), identical token streamtokenize) and lazy (scan) APIsEND_OF_INPUT tokenThe three modal grammars differ in shape and each documents its own scope; modes resolve the contexts above, but the one contextual case still outside the model — lexing steered by indentation (block scalars, heredocs) — is Level B.
Not yet: block scalars / heredocs (Layout Awareness Level B), a compile-time static_lexer (a baked DFA — the Phase-0 spike found this wants build-time codegen, not constexpr), codepoint columns.
See the guided tour for details.
See docs/design.dox for the complete C++ API (lexer, token, position, layout, lex_error).
An abi3 CPython extension (CPython 3.10+, Limited API).
For significant indentation:
pip install scilex (wheels + sdist). Use scilex.get_include() to compile C++ code against the installed headers.
Build locally: make python && make python-test.
A flat rule list can't separate contexts where the same byte means different things — { opens a Python f-string interpolation but a dict elsewhere; < opens an XML tag in content but is just a character inside CDATA. SciLex handles this with an opt-in mode stack: a rule may be restricted to named modes (in_mode) and may push / pop / set the mode when it wins. The engine is unchanged — maximal munch and the exact first-byte dispatch simply run per mode.
This unlocks, with no engine change:
f"sum={a+b}": code ↔ string body ↔ interpolation, nesting through the stack;content ↔ tag (a shallow two-mode flip; CDATA and comments are single regex tokens, so an inner < is literal);block ↔ flow (significant indentation plus flow collections).An action is None | ("push", mode) | ("set", mode) | ("pop",); a plain (kind, pattern, skip) rule needs neither field, so existing grammars are unaffected. See examples/python.hpp, examples/xml.hpp, examples/yaml.hpp for the three modal profiles in full.
A mode can be accelerated by a real::dfa: instead of trying each candidate rule at every position, one DFA pass recognizes the winning rule — the same maximal munch, with the order tie-break baked into the automaton. On a mode where many rules share leading bytes that is **~20× the regular path** on the full token path.
It is best-effort and invisible: a mode whose rules need a zero-width assertion no DFA can represent, or whose DFA fails a build-time audit (a lazy quantifier — its match is the shortest span while a DFA takes the longest), silently stays on the regular Pike engine, absent from dfa_modes_active(). Either way the token stream is byte identical (Pike is the floor) and layout is unchanged. The DFA is built once, in the constructor. The sql and css example grammars ship with it on.
A real trade-off worth stating plainly. Write an identifier rule as \w+ (or [^\W\d]\w*) with the default flags and it reads Unicode identifiers — café, 変数 — the faithful behaviour for a language like Python 3. But a Unicode \w \d \s \b compiles to a match-time code-point predicate that no DFA can represent, so a mode holding one leaves the DFA fast path: it is transparently demoted to the general engine (same tokens, visible via dfa_modes_active()). Concretely the general engine runs at **~8–13 MB/s** while a DFA-able mode runs **~20× that** — the Unicode identifier costs you the DFA.
So: if your identifiers are ASCII by specification (JSON, SQL, C), pin **(?a)** inline in the pattern (or pass real::flags::ascii) to keep \w \d \s \b ASCII, small, and DFA-representable — what the examples/ grammars do. If you want Unicode identifiers, write \w+ and accept the general-engine floor. The two tokenize ASCII input identically; they differ only on non-ASCII input and on whether the mode can be a DFA. The **python-unicode** example (scilex --example python-unicode) is the faithful-Python-3 variant of python, identical but for that one rule.
The layout pass is positional, and by default mode-blind. Layout Awareness Level A lets a mode be marked insignificant (Lexer(insignificant_modes=…)), so its tokens pass through without shaping indentation — and every token carries its mode (Token.mode) for the pass to read.
That lifts two real cases a decoupled positional pass otherwise gets wrong:
[\n 1,\n 2\n] adds no spurious INDENT/DEDENT;() [] {} reads as continuation, not a new block.Two invariants hold: with no insignificant mode the result is byte-for-byte the positional pass (zero cost); and the mode is the single source of the policy (no per-rule flag).
Honest scope. Level A covers multi-line flow and implicit continuation. Block scalars (| / >) and heredocs need a reference indent carried in the mode frame — that is Level B, a designed next step, not yet built. The bundled grammars demonstrate the features; each examples/<lang>.hpp header documents its own scope.
scilex is a command-line lexer — make cli builds it, make install puts it on your PATH (PREFIX=/BINDIR= to choose where). It has two input modes.
Built-in grammars — a showcase over the nine example languages (JSON, Python, C++, SQL, CSS, Lisp, math, XML, YAML):
Your own grammar — the universal mode: bring a .lex file and lex anything. A grammar is one rule per line — name, a tab, regex, then an optional tab and skip (# comments and blank lines are ignored):
Output is one token per line — the kind, a tab, the lexeme, a tab, then line:col; --layout adds the indentation tokens. A malformed grammar is reported with a clear, positioned error (my.lex:3: invalid regex: …) — never a crash. See examples/sample.lex for a worked file.
This .lex format is a tool convenience parsed by the CLI; the library itself stays plain C++ rule lists (std::vector<scilex::rule>) — no spec language is embedded.
SciLex is header-only and depends only on REAL's headers (the package real-regex on PyPI / https://github.com/RECHE23/real-regex).
By default the build looks for them in a sibling checkout:
Point the build elsewhere with REAL_INCLUDE (Makefile) or -DSCILEX_REAL_INCLUDE=... (CMake) — for instance at the path printed by python -c "import real; print(real.get_include())" when REAL is installed via pip.
For CI or a reproducible build — where no on-disk layout can be assumed — fetch REAL with CMake FetchContent instead (make build FETCH=1, or -DSCILEX_FETCH_DEPS=ON); point it at a remote and pin a tag with -DSCILEX_REAL_REPO=https://… -DSCILEX_REAL_TAG=v2026.7.5.
The API reference is published at https://reche23.github.io/scilex/.
Override the compiler with make test CXX=g++-14.
Coverage bar. SciLex holds the SciLang-stack gate — 100% on all four dimensions (lines, functions, regions and branches) of include/, checked by make coverage and enforced by make full-local-gate (using Apple clang 16). The published report on GitHub Pages / the doc tarball (built on clang 18) reads mid-90s (newer clang instruments more branches). This is the documented toolchain distinction; see the live report for exact figures. (REAL is the other documented exception to the 100% gate — see its README.)
scilex::scilex is the CMake target — add_subdirectory, FetchContent, or an installed config package. The config calls find_dependency(real), so installing REAL's config package alongside (on the same prefix) makes the whole chain resolve from one find_package:
make release computes the next calendar version YYYY.M.PATCH (the patch resets each month; PEP 440 drops leading zeros). The pushed tag drives the release workflow — wheels + sdist + the API-reference tarball + a GitHub Release, published via Trusted Publishing — while docs.yml deploys the reference to GitHub Pages.
A guided tour of how SciLex works (maximal munch, REAL foundation, layout, C++/Python API, current scope) lives in docs/design.dox (also rendered by make doc).
See BENCHMARKS.md. On normal input re is faster. On adversarial input SciLex stays linear while re explodes. See the benchmarks for details.
MIT — see [LICENSE](LICENSE).
René Chenard