MarkQL CLI Guide: Intro to Advanced
This guide explains why MarkQL exists, how to use it from the CLI, and how to move from basic to advanced workflows.
Why MarkQL?
MarkQL is useful when you want SQL-like querying over HTML without building a custom scraper for every page shape.
Use MarkQL when you need to:
- Inspect and extract structured data from static HTML quickly.
- Filter elements by attributes, hierarchy, and text rules.
- Iterate fast in terminal/REPL before writing production automation.
- Export results in machine-friendly formats (JSON list, table rows, CSV, Parquet).
Core Mental Model
MarkQL treats HTML elements as rows in a node table.
Each row has core fields:
node_idtagattributesparent_idsibling_posmax_depthdoc_ordersource_uri
Think of it as:
SELECT <tag or projected fields>FROM <html source>WHERE <filters>- optional
LIMIT,TO LIST,TO TABLE,TO CSV,TO PARQUET,TO JSON,TO NDJSON
For PROJECT(...), keep this exact mental model:
PROJECT(base_tag)chooses row candidates by tag (PROJECT(document)behaves like all tags).- Outer
WHEREfilters those row candidates. - Field predicates inside
PROJECT(... AS (...))choose which row-scoped node provides each field value. - Row scope for field extraction is the row node plus its descendants.
Short version:
PROJECT picks candidates, outer WHERE filters rows, field WHERE picks values.
Deep explanation:
CLI Setup
Build:
./build.sh
Run one query:
./build/markql --query "SELECT a FROM doc WHERE href CONTAINS 'https'" --input ./data/index.html
Run REPL:
./build/markql --interactive --input ./data/index.html
Lint query (parse + validate only, no execution):
./build/markql --lint "SELECT FROM doc"
Force colorized human diagnostics:
./build/markql --lint "SELECT FROM doc" --color=always
Lint as JSON:
./build/markql --lint "SELECT FROM doc" --format json
Check version + provenance:
./build/markql --version
Compatibility note:
./build/xsqlremains available as a legacy command name.
Diagnostics and Lint
Use lint mode to validate syntax + key semantic rules without loading/executing data:
./build/markql --lint "SELECT TEXT(node_div) FROM doc AS node_div WHERE node_div.tag = 'div'"
Default lint output provides one diagnostic block per issue:
- severity and stable
code - caret-positioned snippet
help:fix guidance
Lint text color controls:
--color=always: always emit ANSI colors for human lint text--color=auto: emit ANSI colors only on a TTY--color=never(or--color=disabled): force plain textNO_COLORoverrides color and forces plain text
--format json is always uncolored for automation stability.
SELECT <from_alias> now emits a warning (MQL-LINT-0001) because alias-as-value
is ambiguous; use SELECT self for current-row node projection.
doc_ref remains available in JSON diagnostics output.
JSON format is available for automation:
./build/markql --lint "SELECT FROM doc" --format json
Exit codes:
0no ERROR diagnostics1one or more ERROR diagnostics2CLI/tooling failure
Normal query execution uses the same diagnostic formatting for invalid queries (instead of raw exception text).
Version and Provenance
CLI version output includes provenance:
- semantic version
- git commit hash
-dirtysuffix when built from a dirty worktree- version source of truth:
python/xsql/_meta.py(__version__)
Python exposes the same core provenance:
import xsql
print(xsql.__version__)
print(xsql.core_version())
print(xsql.core_version_info())
Fast Start: 5 Queries
SELECT div FROM doc LIMIT 5;
SELECT node_link FROM doc AS node_link WHERE node_link.href CONTAINS 'https';
SELECT node_link.href FROM doc AS node_link WHERE node_link.rel = 'preload' TO LIST();
SELECT table FROM doc TO TABLE();
SELECT COUNT(a) FROM doc;
Sources
Use different input sources:
SELECT div FROM document;
SELECT div FROM 'page.html';
SELECT div FROM 'https://example.com';
SELECT div FROM RAW('<div class="x">hello</div>');
SELECT li FROM PARSE('<ul><li>1</li><li>2</li></ul>') AS node_fragment;
Alias sources:
SELECT node_doc FROM document AS node_doc WHERE node_doc.id = 'login';
Aliases and fields
Use alias.field as the primary row-reference style.
FROM doc binds an implicit row alias named doc:
SELECT doc.node_id, doc.tag
FROM doc
WHERE doc.tag = 'div';
You can still bind an explicit alias:
SELECT node_div.node_id, node_div.tag
FROM doc AS node_div
WHERE node_div.tag = 'div';
Rules:
- Once you alias
doc(for exampleFROM doc AS node_div), use only that alias in row references. doc.*is not bound in that scope.- Recommended style (not required by the language):
node_<semantic>for DOM node rows.r_<semantic>for CTE/derived logical rows.
PARSE(...) accepts either:
- an HTML string expression
- a subquery returning HTML strings (single projected column)
Example:
SELECT li
FROM PARSE(
SELECT inner_html(div, 2)
FROM doc
WHERE attributes.class = 'pagination'
) AS node_fragment;
Compatibility note:
FRAGMENTS(...)still works but is deprecated.- Migration:
FRAGMENTS(x)->PARSE(x).
WITH, JOIN, and LATERAL
MarkQL supports SQL-style CTEs and joins with deterministic row order.
WITH ...defines statement-local relations.JOIN/LEFT JOINuseON ....CROSS JOINis cartesian and does not takeON.CROSS JOIN LATERAL (...) AS node_rightruns the right subquery per left row (flatMap behavior).
Efficient baseline pattern (filtered rows + lateral expansion + selective left joins):
WITH r_rows AS (
SELECT node_row.node_id AS row_id
FROM doc AS node_row
WHERE node_row.tag = 'tr' AND EXISTS(child WHERE tag = 'td')
),
r_cells AS (
SELECT
r_row.row_id,
node_cell.sibling_pos AS pos,
TEXT(node_cell) AS val
FROM r_rows AS r_row
CROSS JOIN LATERAL (
SELECT node_cell
FROM doc AS node_cell
WHERE node_cell.parent_id = r_row.row_id
AND node_cell.tag = 'td'
) AS node_cell
)
SELECT
r_row.row_id,
r_cell_2.val AS item_id,
r_cell_4.val AS item_name
FROM r_rows AS r_row
LEFT JOIN r_cells AS r_cell_2 ON r_cell_2.row_id = r_row.row_id AND r_cell_2.pos = 2
LEFT JOIN r_cells AS r_cell_4 ON r_cell_4.row_id = r_row.row_id AND r_cell_4.pos = 4
ORDER BY r_row.row_id;
Derived-table source form is also supported:
SELECT r_rows.row_id
FROM (
SELECT node_row.node_id AS row_id
FROM doc AS node_row
WHERE node_row.tag = 'tr'
) AS r_rows
ORDER BY r_rows.row_id;
Filtering with WHERE
Basic operators:
=<>/!=<,<=,>,>=IN (...)LIKE(%any sequence,_one character)IS NULL/IS NOT NULL~regexCONTAINS,CONTAINS ALL,CONTAINS ANY(attributes)HAS_DIRECT_TEXT(legacy operator shorthand)EXISTS(axis [WHERE expr])
Examples:
SELECT div FROM doc WHERE id = 'main';
SELECT node_link FROM doc AS node_link WHERE node_link.href IN ('/a','/b');
SELECT node_link FROM doc AS node_link WHERE node_link.href ~ '.*\.pdf$';
SELECT div FROM doc WHERE text LIKE '%coupon%';
SELECT div FROM doc WHERE POSITION('coupon' IN LOWER(text)) > 0;
SELECT div FROM doc WHERE attributes IS NULL;
SELECT div FROM doc WHERE DIRECT_TEXT(div) LIKE '%buy now%';
SELECT div FROM doc WHERE EXISTS(child);
SELECT div FROM doc WHERE EXISTS(child WHERE tag = 'h2');
SQL-style direct text form (preferred over HAS_DIRECT_TEXT):
SELECT div FROM doc WHERE DIRECT_TEXT(div) LIKE '%buy now%';
Current behavior note:
LIKEmatching is ASCII case-insensitive in this release.
Reserved keywords used by these features:
LIKECONCATSUBSTRING/SUBSTRLENGTH/CHAR_LENGTHPOSITION/LOCATEREPLACELOWER/UPPERLTRIM/RTRIMDIRECT_TEXTCASE/WHEN/THEN/ELSE/END
Hierarchy (Axes)
Axes let you filter by relationships:
parentchildancestordescendant
Examples:
SELECT span FROM doc WHERE parent.tag = 'div';
SELECT div FROM doc WHERE descendant.attributes.data-testid = 'review-text';
SELECT a FROM doc WHERE ancestor.id = 'content';
Important parser detail for axis attributes:
- Use
child.attr.foo,parent.attr.foo,descendant.attr.foo, etc. attris a shorthand alias ofattributes(child.attributes.foostill works).- Prefer
attrin new queries for shorter, consistent syntax. - In this branch, shorthand like
child.foomay fail parse.
EXISTS predicate:
- Syntax:
EXISTS(self|parent|child|ancestor|descendant [WHERE <expr>]) EXISTS(axis)checks whether at least one node exists on that axis.EXISTS(axis WHERE ...)evaluates<expr>on each axis node and returns true if any one node matches.- Conditions inside
EXISTS(... WHERE ...)are applied to the same axis node.
Examples:
SELECT div FROM doc WHERE EXISTS(child);
SELECT div FROM doc WHERE EXISTS(child WHERE tag = 'h2');
SELECT div FROM doc WHERE EXISTS(child WHERE tag = 'span' AND attributes.class = 'price');
Projections
Project fields:
SELECT link.href FROM doc;
SELECT div(node_id, tag, parent_id) FROM doc;
Project functions:
SELECT inner_html(div) FROM doc WHERE id = 'card';
SELECT raw_inner_html(div) FROM doc WHERE id = 'card';
SELECT trim(inner_html(div)) FROM doc WHERE id = 'card';
SELECT text(div) FROM doc WHERE attributes.class = 'summary';
SELECT lower(replace(trim(text(div)), ' ', '-')) AS slug FROM doc WHERE attributes.class = 'summary';
Notes:
TEXT()andINNER_HTML()require aWHEREwith a non-tag filter.INNER_HTML()is minified by default; useRAW_INNER_HTML()for raw spacing.LENGTH()/CHAR_LENGTH()currently count UTF-8 bytes.
SQL String Functions
Available in SELECT, WHERE, and inside PROJECT(...) expressions:
CONCAT(a, b, ...)SUBSTRING(str, start, len)andSUBSTR(...)LENGTH(str)andCHAR_LENGTH(str)(UTF-8 byte length)POSITION(substr IN str)andLOCATE(substr, str[, start])REPLACE(str, from, to)LOWER(str),UPPER(str)LTRIM(str),RTRIM(str),TRIM(str)DIRECT_TEXT(tag)
Examples:
SELECT CONCAT(attributes.class, '-x') AS label
FROM doc
WHERE tag = 'div';
SELECT SUBSTRING(TRIM(TEXT(div)), 1, 10) AS preview
FROM doc
WHERE attributes.id = 'card';
FLATTEN_TEXT / FLATTEN
FLATTEN_TEXT is for extracting ordered text slices from descendant nodes.
SELECT FLATTEN_TEXT(div) AS (date, body)
FROM doc
WHERE attributes.class = 'review'
AND descendant.attributes.data-testid CONTAINS ANY ('review-date', 'review-text');
Alias:
SELECT FLATTEN(div) AS (value) FROM doc WHERE descendant.tag = 'span';
Common mistakes:
FLATTENT(...)is invalid. UseFLATTEN_TEXT(...)orFLATTEN(...).- If validation requires aliases, use
AS (col1, col2, ...).
PROJECT
PROJECT is for stable field extraction per base row using expression mapping.
Supported expression forms:
TEXT(tag WHERE <predicate>)ATTR(tag, attr WHERE <predicate>)TEXT(..., <n>)/ATTR(..., <n>)for 1-based stable selectionFIRST_TEXT(...),LAST_TEXT(...),FIRST_ATTR(...),LAST_ATTR(...)COALESCE(expr1, expr2, ...)DIRECT_TEXT(tag [WHERE <predicate>])CASE WHEN <boolean_expr> THEN <value_expr> [ELSE <value_expr>] END- SQL string functions (for example
LOWER(REPLACE(TRIM(TEXT(h2)), ' ', '-'))) - Alias references to previous fields in the same
AS (...)block - Top-level comparisons on expressions (for example
POSITION('coupon' IN LOWER(TEXT(li))) > 0)
Example:
SELECT tr.node_id,
PROJECT(tr) AS (
period: TEXT(td WHERE sibling_pos = 1),
pdf_direct: COALESCE(
ATTR(a, href WHERE parent.sibling_pos = 3 AND href CONTAINS '.pdf'),
TEXT(td WHERE sibling_pos = 3)
),
excel_direct: COALESCE(
ATTR(a, href WHERE parent.sibling_pos = 5 AND href CONTAINS '.xlsx'),
TEXT(td WHERE sibling_pos = 5)
)
)
FROM doc
WHERE EXISTS(child WHERE tag = 'td');
Notes:
AS (...)is required and must usealias: expression.COALESCEreturns the first non-NULL, non-blank extracted value.- Prefer
DIRECT_TEXT(td) LIKE '%2025%'as the default direct-text filter form. HAS_DIRECT_TEXTremains available as operator shorthand (td HAS_DIRECT_TEXT '2025').- Selector indexes are 1-based (
TEXT(..., 2)is the second match). Out-of-range indexes returnNULL. FLATTEN_EXTRACT(...)is kept as a compatibility alias.- Fields are evaluated left-to-right; later aliases can reference earlier ones.
Output Modes
TO LIST
Single projected column to JSON list:
SELECT a.href FROM doc WHERE href IS NOT NULL TO LIST();
TO TABLE
Extract HTML table rows. Default behavior is backward compatible: without new options, output matches prior rectangular table output.
SELECT table FROM doc TO TABLE();
SELECT table FROM doc TO TABLE(HEADER=OFF);
SELECT table FROM doc WHERE id = 'stats' TO TABLE(EXPORT='stats.csv');
Options:
HEADER=ON|OFF(defaultON)NOHEADER/NO_HEADER(same asHEADER=OFF)HEADER_NORMALIZE=ON|OFF(defaultONwhenHEADER=ON)TRIM_EMPTY_ROWS=OFF|ON(defaultOFF)TRIM_EMPTY_COLS=OFF|TRAILING|ALL(defaultOFF)EMPTY_IS=BLANK_OR_NULL|NULL_ONLY|BLANK_ONLY(defaultBLANK_OR_NULL)STOP_AFTER_EMPTY_ROWS=<int>(default0, disabled)FORMAT=RECT|SPARSE(defaultRECT)SPARSE_SHAPE=LONG|WIDE(defaultLONG, only whenFORMAT=SPARSE)EXPORT='path.csv'(single-table CSV export)
Semantics:
TRIM_EMPTY_ROWS=ON: drop rows where every cell is empty underEMPTY_IS.TRIM_EMPTY_COLS=OFF: keep all columns.TRIM_EMPTY_COLS=TRAILING: drop only right-edge empty columns.TRIM_EMPTY_COLS=ALL: drop all columns empty across remaining rows (including internal empty columns).EMPTY_IS=BLANK_OR_NULL: empty means blank text or missing/unextractable cell.EMPTY_IS=NULL_ONLY: empty means missing/unextractable cell only.EMPTY_IS=BLANK_ONLY: empty means normalized blank text only.STOP_AFTER_EMPTY_ROWS=N(N > 0): stop output afterNconsecutive all-empty rows.FORMAT=RECT: rectangular rows/cells (default).FORMAT=SPARSE, SPARSE_SHAPE=LONG: emit one row per non-empty cell withrow_index,col_index, optionalheader(whenHEADER=ON), andvalue.FORMAT=SPARSE, SPARSE_SHAPE=WIDE: emit one object per table row with only non-empty keys; keys use normalized header whenHEADER=ON, elsecol_<n>.HEADER_NORMALIZE=ON: trim/collapse whitespace, remove duplicate adjacent tokens, keep Unicode, fallback tocol_<n>when result is empty.
Trim example:
SELECT table FROM doc
TO TABLE(TRIM_EMPTY_ROWS=ON, TRIM_EMPTY_COLS=TRAILING);
Sparse LONG example:
SELECT table FROM doc
TO TABLE(FORMAT=SPARSE, SPARSE_SHAPE=LONG, HEADER=ON);
Sparse WIDE example:
SELECT table FROM doc
TO TABLE(FORMAT=SPARSE, SPARSE_SHAPE=WIDE, HEADER=ON);
Determinism:
- Same DOM snapshot + same query/options => byte-identical output.
TO CSV / TO PARQUET
SELECT a.href, TEXT(a) FROM doc WHERE href IS NOT NULL TO CSV('links.csv');
SELECT * FROM doc TO PARQUET('nodes.parquet');
By default, exported column names are normalized to identifier-safe names
(for example data-id -> data_id).
TO JSON / TO NDJSON
SELECT a.href, TEXT(a) FROM doc WHERE href IS NOT NULL TO JSON('links.json');
SELECT a.href, TEXT(a) FROM doc WHERE href IS NOT NULL TO NDJSON('links.ndjson');
Both also accept empty destination (TO JSON() / TO NDJSON()) to stream to stdout.
REPL Workflow
Useful commands:
.help.load <path|url> [--alias <name>].mode duckbox|json|plain.set colnames raw|normalize.lint on|off.display_mode more|less.max_rows <n|inf>DESCRIBE LAST.summarize [doc|alias|path|url].reload_config.quit
Column-name modes:
normalize(default): use identifier-safe output headers/keys.raw: keep original projected names.DESCRIBE LAST: showraw_nameandoutput_namefor the previous query.
Vim navigation mode:
- Default editor mode is normal.
- Press
Escto switch into Vim normal mode. - In Vim mode, press
Escto toggle betweenvim:editandvim:normal, and back to normal mode. - Prompts:
- normal:
markql> - vim normal:
markql (vim:normal)> - vim edit:
markql (vim:edit) >(padded to keep width aligned)
- normal:
- Vim keys:
h/j/k/l,i/a/I/A,o/O.
Great iterative pattern:
.loadinput- Start with
SELECT * ... LIMIT 5 - Add
WHEREfilters - Add projections/functions
- Export with
TO LIST/TO CSV/TO TABLE
Practical Advanced Use Cases
1) Review extraction
SELECT div.data-review-id, FLATTEN_TEXT(div) AS (review_text)
FROM doc
WHERE attributes.class = 'review'
AND descendant.attributes.data-testid = 'review-text';
2) Navigation audit
SELECT a.href, TEXT(a)
FROM doc
WHERE ancestor.id = 'navbar' AND href IS NOT NULL
TO CSV('nav_links.csv');
3) Content block quality checks
SELECT section
FROM doc
WHERE attributes.class CONTAINS 'content'
AND descendant.tag IN ('h1','h2','p');
4) Table extraction for analytics
SELECT table FROM doc WHERE id = 'report' TO TABLE(EXPORT='report.csv');
Troubleshooting
If you get parse errors:
Expected FROM: missingFROM ....Expected attributes, tag, text... after child: usechild.attributes.<name>.FLATTEN_TEXT() requires AS (...): add column aliases.TEXT()/INNER_HTML() requires a non-tag filter: add attribute/parent style predicate.
If no rows return:
- Check whether attribute names are exact (
data-testidvsdata-test-id). - Use broader filter first (
CONTAINS) then narrow. - Test with
LIMITand simpler predicates.
Self-Discovery Commands
Use built-in metadata queries while learning:
SHOW FUNCTIONS;
SHOW AXES;
SHOW OPERATORS;
DESCRIBE doc;
DESCRIBE language;