Skip to the content.

MarkQL CLI Guide: Intro to Advanced

This guide explains why MarkQL exists, how to use it from the CLI, and how to move from basic to advanced workflows.

Why MarkQL?

MarkQL is useful when you want SQL-like querying over HTML without building a custom scraper for every page shape.

Use MarkQL when you need to:

Core Mental Model

MarkQL treats HTML elements as rows in a node table.

Each row has core fields:

Think of it as:

For PROJECT(...), keep this exact mental model:

Short version:

PROJECT picks candidates, outer WHERE filters rows, field WHERE picks values.

Deep explanation:

CLI Setup

Build:

./build.sh

Run one query:

./build/markql --query "SELECT a FROM doc WHERE href CONTAINS 'https'" --input ./data/index.html

Run REPL:

./build/markql --interactive --input ./data/index.html

Lint query (parse + validate only, no execution):

./build/markql --lint "SELECT FROM doc"

Force colorized human diagnostics:

./build/markql --lint "SELECT FROM doc" --color=always

Lint as JSON:

./build/markql --lint "SELECT FROM doc" --format json

Check version + provenance:

./build/markql --version

Compatibility note:

Diagnostics and Lint

Use lint mode to validate syntax + key semantic rules without loading/executing data:

./build/markql --lint "SELECT TEXT(node_div) FROM doc AS node_div WHERE node_div.tag = 'div'"

Default lint output provides one diagnostic block per issue:

Lint text color controls:

--format json is always uncolored for automation stability.

SELECT <from_alias> now emits a warning (MQL-LINT-0001) because alias-as-value is ambiguous; use SELECT self for current-row node projection.

doc_ref remains available in JSON diagnostics output.

JSON format is available for automation:

./build/markql --lint "SELECT FROM doc" --format json

Exit codes:

Normal query execution uses the same diagnostic formatting for invalid queries (instead of raw exception text).

Version and Provenance

CLI version output includes provenance:

Python exposes the same core provenance:

import xsql
print(xsql.__version__)
print(xsql.core_version())
print(xsql.core_version_info())

Fast Start: 5 Queries

SELECT div FROM doc LIMIT 5;
SELECT node_link FROM doc AS node_link WHERE node_link.href CONTAINS 'https';
SELECT node_link.href FROM doc AS node_link WHERE node_link.rel = 'preload' TO LIST();
SELECT table FROM doc TO TABLE();
SELECT COUNT(a) FROM doc;

Sources

Use different input sources:

SELECT div FROM document;
SELECT div FROM 'page.html';
SELECT div FROM 'https://example.com';
SELECT div FROM RAW('<div class="x">hello</div>');
SELECT li FROM PARSE('<ul><li>1</li><li>2</li></ul>') AS node_fragment;

Alias sources:

SELECT node_doc FROM document AS node_doc WHERE node_doc.id = 'login';

Aliases and fields

Use alias.field as the primary row-reference style.

FROM doc binds an implicit row alias named doc:

SELECT doc.node_id, doc.tag
FROM doc
WHERE doc.tag = 'div';

You can still bind an explicit alias:

SELECT node_div.node_id, node_div.tag
FROM doc AS node_div
WHERE node_div.tag = 'div';

Rules:

PARSE(...) accepts either:

Example:

SELECT li
FROM PARSE(
  SELECT inner_html(div, 2)
  FROM doc
  WHERE attributes.class = 'pagination'
) AS node_fragment;

Compatibility note:

WITH, JOIN, and LATERAL

MarkQL supports SQL-style CTEs and joins with deterministic row order.

Efficient baseline pattern (filtered rows + lateral expansion + selective left joins):

WITH r_rows AS (
  SELECT node_row.node_id AS row_id
  FROM doc AS node_row
  WHERE node_row.tag = 'tr' AND EXISTS(child WHERE tag = 'td')
),
r_cells AS (
  SELECT
    r_row.row_id,
    node_cell.sibling_pos AS pos,
    TEXT(node_cell) AS val
  FROM r_rows AS r_row
  CROSS JOIN LATERAL (
    SELECT node_cell
    FROM doc AS node_cell
    WHERE node_cell.parent_id = r_row.row_id
      AND node_cell.tag = 'td'
  ) AS node_cell
)
SELECT
  r_row.row_id,
  r_cell_2.val AS item_id,
  r_cell_4.val AS item_name
FROM r_rows AS r_row
LEFT JOIN r_cells AS r_cell_2 ON r_cell_2.row_id = r_row.row_id AND r_cell_2.pos = 2
LEFT JOIN r_cells AS r_cell_4 ON r_cell_4.row_id = r_row.row_id AND r_cell_4.pos = 4
ORDER BY r_row.row_id;

Derived-table source form is also supported:

SELECT r_rows.row_id
FROM (
  SELECT node_row.node_id AS row_id
  FROM doc AS node_row
  WHERE node_row.tag = 'tr'
) AS r_rows
ORDER BY r_rows.row_id;

Filtering with WHERE

Basic operators:

Examples:

SELECT div FROM doc WHERE id = 'main';
SELECT node_link FROM doc AS node_link WHERE node_link.href IN ('/a','/b');
SELECT node_link FROM doc AS node_link WHERE node_link.href ~ '.*\.pdf$';
SELECT div FROM doc WHERE text LIKE '%coupon%';
SELECT div FROM doc WHERE POSITION('coupon' IN LOWER(text)) > 0;
SELECT div FROM doc WHERE attributes IS NULL;
SELECT div FROM doc WHERE DIRECT_TEXT(div) LIKE '%buy now%';
SELECT div FROM doc WHERE EXISTS(child);
SELECT div FROM doc WHERE EXISTS(child WHERE tag = 'h2');

SQL-style direct text form (preferred over HAS_DIRECT_TEXT):

SELECT div FROM doc WHERE DIRECT_TEXT(div) LIKE '%buy now%';

Current behavior note:

Reserved keywords used by these features:

Hierarchy (Axes)

Axes let you filter by relationships:

Examples:

SELECT span FROM doc WHERE parent.tag = 'div';
SELECT div FROM doc WHERE descendant.attributes.data-testid = 'review-text';
SELECT a FROM doc WHERE ancestor.id = 'content';

Important parser detail for axis attributes:

EXISTS predicate:

Examples:

SELECT div FROM doc WHERE EXISTS(child);
SELECT div FROM doc WHERE EXISTS(child WHERE tag = 'h2');
SELECT div FROM doc WHERE EXISTS(child WHERE tag = 'span' AND attributes.class = 'price');

Projections

Project fields:

SELECT link.href FROM doc;
SELECT div(node_id, tag, parent_id) FROM doc;

Project functions:

SELECT inner_html(div) FROM doc WHERE id = 'card';
SELECT raw_inner_html(div) FROM doc WHERE id = 'card';
SELECT trim(inner_html(div)) FROM doc WHERE id = 'card';
SELECT text(div) FROM doc WHERE attributes.class = 'summary';
SELECT lower(replace(trim(text(div)), ' ', '-')) AS slug FROM doc WHERE attributes.class = 'summary';

Notes:

SQL String Functions

Available in SELECT, WHERE, and inside PROJECT(...) expressions:

Examples:

SELECT CONCAT(attributes.class, '-x') AS label
FROM doc
WHERE tag = 'div';
SELECT SUBSTRING(TRIM(TEXT(div)), 1, 10) AS preview
FROM doc
WHERE attributes.id = 'card';

FLATTEN_TEXT / FLATTEN

FLATTEN_TEXT is for extracting ordered text slices from descendant nodes.

SELECT FLATTEN_TEXT(div) AS (date, body)
FROM doc
WHERE attributes.class = 'review'
  AND descendant.attributes.data-testid CONTAINS ANY ('review-date', 'review-text');

Alias:

SELECT FLATTEN(div) AS (value) FROM doc WHERE descendant.tag = 'span';

Common mistakes:

PROJECT

PROJECT is for stable field extraction per base row using expression mapping.

Supported expression forms:

Example:

SELECT tr.node_id,
PROJECT(tr) AS (
  period: TEXT(td WHERE sibling_pos = 1),
  pdf_direct: COALESCE(
    ATTR(a, href WHERE parent.sibling_pos = 3 AND href CONTAINS '.pdf'),
    TEXT(td WHERE sibling_pos = 3)
  ),
  excel_direct: COALESCE(
    ATTR(a, href WHERE parent.sibling_pos = 5 AND href CONTAINS '.xlsx'),
    TEXT(td WHERE sibling_pos = 5)
  )
)
FROM doc
WHERE EXISTS(child WHERE tag = 'td');

Notes:

Output Modes

TO LIST

Single projected column to JSON list:

SELECT a.href FROM doc WHERE href IS NOT NULL TO LIST();

TO TABLE

Extract HTML table rows. Default behavior is backward compatible: without new options, output matches prior rectangular table output.

SELECT table FROM doc TO TABLE();
SELECT table FROM doc TO TABLE(HEADER=OFF);
SELECT table FROM doc WHERE id = 'stats' TO TABLE(EXPORT='stats.csv');

Options:

Semantics:

Trim example:

SELECT table FROM doc
TO TABLE(TRIM_EMPTY_ROWS=ON, TRIM_EMPTY_COLS=TRAILING);

Sparse LONG example:

SELECT table FROM doc
TO TABLE(FORMAT=SPARSE, SPARSE_SHAPE=LONG, HEADER=ON);

Sparse WIDE example:

SELECT table FROM doc
TO TABLE(FORMAT=SPARSE, SPARSE_SHAPE=WIDE, HEADER=ON);

Determinism:

TO CSV / TO PARQUET

SELECT a.href, TEXT(a) FROM doc WHERE href IS NOT NULL TO CSV('links.csv');
SELECT * FROM doc TO PARQUET('nodes.parquet');

By default, exported column names are normalized to identifier-safe names (for example data-id -> data_id).

TO JSON / TO NDJSON

SELECT a.href, TEXT(a) FROM doc WHERE href IS NOT NULL TO JSON('links.json');
SELECT a.href, TEXT(a) FROM doc WHERE href IS NOT NULL TO NDJSON('links.ndjson');

Both also accept empty destination (TO JSON() / TO NDJSON()) to stream to stdout.

REPL Workflow

Useful commands:

Column-name modes:

Vim navigation mode:

Great iterative pattern:

  1. .load input
  2. Start with SELECT * ... LIMIT 5
  3. Add WHERE filters
  4. Add projections/functions
  5. Export with TO LIST/TO CSV/TO TABLE

Practical Advanced Use Cases

1) Review extraction

SELECT div.data-review-id, FLATTEN_TEXT(div) AS (review_text)
FROM doc
WHERE attributes.class = 'review'
  AND descendant.attributes.data-testid = 'review-text';

2) Navigation audit

SELECT a.href, TEXT(a)
FROM doc
WHERE ancestor.id = 'navbar' AND href IS NOT NULL
TO CSV('nav_links.csv');

3) Content block quality checks

SELECT section
FROM doc
WHERE attributes.class CONTAINS 'content'
  AND descendant.tag IN ('h1','h2','p');

4) Table extraction for analytics

SELECT table FROM doc WHERE id = 'report' TO TABLE(EXPORT='report.csv');

Troubleshooting

If you get parse errors:

If no rows return:

Self-Discovery Commands

Use built-in metadata queries while learning:

SHOW FUNCTIONS;
SHOW AXES;
SHOW OPERATORS;
DESCRIBE doc;
DESCRIBE language;