Skip to the content.

MarkQL CLI Guide: Intro to Advanced

This guide explains why MarkQL exists, how to use it from the CLI, and how to move from basic to advanced workflows.

Why MarkQL?

MarkQL is useful when you want SQL-like querying over HTML without building a custom scraper for every page shape.

Use MarkQL when you need to:

Core Mental Model

MarkQL treats HTML elements as rows in a node table.

Each row has core fields:

Think of it as:

For PROJECT(...), keep this exact mental model:

Short version:

PROJECT picks candidates, outer WHERE filters rows, field WHERE picks values.

Deep explanation:

CLI Setup

Build:

./scripts/build/build.sh

Build note:

Create a parsed document snapshot once:

./build/markql --input ./data/index.html --write-mqd ./cache/index.mqd

Create a prepared query once:

./build/markql --query "SELECT a.href, TEXT(a) FROM doc WHERE href IS NOT NULL" \
  --write-mqp ./cache/links.mqp

Inspect artifact metadata:

./build/markql --artifact-info ./cache/index.mqd

Run one query:

./build/markql --query "SELECT a FROM doc WHERE href CONTAINS 'https'" --input ./data/index.html

Run REPL:

./build/markql --interactive --input ./data/index.html

Lint query (parse + validate only, no execution):

./build/markql --lint "SELECT FROM doc"

Force colorized human diagnostics:

./build/markql --lint "SELECT FROM doc" --color=always

Lint as JSON:

./build/markql --lint "SELECT FROM doc" --format json

Render a .mql.j2 query file into plain MarkQL and lint the rendered text:

./build/markql \
  --query-file tests/fixtures/render/generic_query.mql.j2 \
  --render j2 \
  --vars tests/fixtures/render/generic_query.toml \
  --lint

Write rendered MarkQL to a file before continuing through lint/execute:

./build/markql \
  --query-file tests/fixtures/render/generic_query.mql.j2 \
  --render j2 \
  --vars tests/fixtures/render/generic_query.toml \
  --rendered-out /tmp/generic_query.mql \
  --lint

Preview rendered MarkQL on stdout:

./build/markql \
  --query-file tests/fixtures/render/generic_query.mql.j2 \
  --render j2 \
  --vars tests/fixtures/render/generic_query.toml \
  --rendered-out -

Check version + provenance:

./build/markql --version

Compatibility note:

Template Query Files

MarkQL supports an opt-in pre-parse render step for templated query files.

MVP contract:

Recommended naming:

Artifact Workflow

Experimental status:

MarkQL can cache two conservative, versioned artifacts:

These artifacts only remove repeated work:

They do not change query semantics. A direct HTML + SQL run and an equivalent .mqd + .mqp run should produce the same rows.

Recommended usage:

./build/markql --input ./data/index.html --write-mqd ./cache/index.mqd
./build/markql --query-file ./queries/links.sql --write-mqp ./cache/links.mqp
./build/markql --query-file ./cache/links.mqp --input ./cache/index.mqd

Compatibility rules for the MVP:

Current limitations:

Security / trust model:

Prepared-query semantic boundary:

Benchmark snapshot on examples/html/koku_tk.html:

Diagnostics and Lint

Use lint mode to validate syntax + key semantic rules without loading/executing data:

./build/markql --lint "SELECT TEXT(node_div) FROM doc AS node_div WHERE node_div.tag = 'div'"

Default lint output provides one diagnostic block per issue:

For parse failures, lint now prefers local repair guidance when it can prove the failing context. Examples:

Clean lint output is intentionally cautious:

Lint text color controls:

--format json is always uncolored for automation stability.

When linting a templated query file:

--format json now emits a top-level lint result object:

summary includes:

Each diagnostics[] entry keeps the existing diagnostic fields and now includes richer editor-friendly fields:

SELECT <from_alias> now emits a warning (MQL-LINT-0001) because alias-as-value is ambiguous; use SELECT self for current-row node projection.

Suspicious qualified member accesses can also emit warnings when MarkQL accepts them as dynamic attribute syntax but they look like likely mistakes, for example a near-miss built-in field such as alias.tagy.

doc_ref remains available in JSON diagnostics output.

JSON format is available for automation:

./build/markql --lint "SELECT FROM doc" --format json

Exit codes:

Normal query execution uses the same diagnostic formatting for invalid queries (instead of raw exception text).

Version and Provenance

CLI version output includes provenance:

Python exposes the same core provenance:

import markql
print(markql.__version__)
print(markql.core_version())
print(markql.core_version_info())

Artifact files also record producer version metadata. --artifact-info shows:

Fast Start: 5 Queries

SELECT div FROM doc LIMIT 5;
SELECT node_link FROM doc AS node_link WHERE node_link.href CONTAINS 'https';
SELECT node_link.href FROM doc AS node_link WHERE node_link.rel = 'preload' TO LIST();
SELECT table FROM doc TO TABLE();
SELECT COUNT(a) FROM doc;

Sources

Use different input sources:

SELECT div FROM document;
SELECT div FROM 'page.html';
SELECT div FROM 'https://example.com';
SELECT div FROM RAW('<div class="x">hello</div>');
SELECT li FROM PARSE('<ul><li>1</li><li>2</li></ul>') AS node_fragment;

For repeated CLI runs, you can also treat a .mqd snapshot as the --input source:

./build/markql --query "SELECT a.href FROM doc WHERE href IS NOT NULL" \
  --input ./cache/index.mqd

Alias sources:

SELECT node_doc FROM document AS node_doc WHERE node_doc.id = 'login';

Aliases and fields

Use alias.field as the primary row-reference style.

FROM doc binds an implicit row alias named doc:

SELECT doc.node_id, doc.tag
FROM doc
WHERE doc.tag = 'div';

You can still bind an explicit alias:

SELECT node_div.node_id, node_div.tag
FROM doc AS node_div
WHERE node_div.tag = 'div';

Rules:

PARSE(...) accepts either:

Example:

SELECT li
FROM PARSE(
  SELECT inner_html(div, 2)
  FROM doc
  WHERE attributes.class = 'pagination'
) AS node_fragment;

Compatibility note:

WITH, JOIN, and LATERAL

MarkQL supports SQL-style CTEs and joins with deterministic row order.

Efficient baseline pattern (filtered rows + lateral expansion + selective left joins):

WITH r_rows AS (
  SELECT node_row.node_id AS row_id
  FROM doc AS node_row
  WHERE node_row.tag = 'tr' AND EXISTS(child WHERE tag = 'td')
),
r_cells AS (
  SELECT
    r_row.row_id,
    node_cell.sibling_pos AS pos,
    TEXT(node_cell) AS val
  FROM r_rows AS r_row
  CROSS JOIN LATERAL (
    SELECT node_cell
    FROM doc AS node_cell
    WHERE node_cell.parent_id = r_row.row_id
      AND node_cell.tag = 'td'
  ) AS node_cell
)
SELECT
  r_row.row_id,
  r_cell_2.val AS item_id,
  r_cell_4.val AS item_name
FROM r_rows AS r_row
LEFT JOIN r_cells AS r_cell_2 ON r_cell_2.row_id = r_row.row_id AND r_cell_2.pos = 2
LEFT JOIN r_cells AS r_cell_4 ON r_cell_4.row_id = r_row.row_id AND r_cell_4.pos = 4
ORDER BY r_row.row_id;

Derived-table source form is also supported:

SELECT r_rows.row_id
FROM (
  SELECT node_row.node_id AS row_id
  FROM doc AS node_row
  WHERE node_row.tag = 'tr'
) AS r_rows
ORDER BY r_rows.row_id;

Filtering with WHERE

Basic operators:

Examples:

SELECT div FROM doc WHERE id = 'main';
SELECT node_link FROM doc AS node_link WHERE node_link.href IN ('/a','/b');
SELECT node_link FROM doc AS node_link WHERE node_link.href ~ '.*\.pdf$';
SELECT div FROM doc WHERE text LIKE '%coupon%';
SELECT div FROM doc WHERE POSITION('coupon' IN LOWER(text)) > 0;
SELECT div FROM doc WHERE attributes IS NULL;
SELECT div FROM doc WHERE DIRECT_TEXT(div) LIKE '%buy now%';
SELECT div FROM doc WHERE EXISTS(child);
SELECT div FROM doc WHERE EXISTS(child WHERE tag = 'h2');

SQL-style direct text form (preferred over HAS_DIRECT_TEXT):

SELECT div FROM doc WHERE DIRECT_TEXT(div) LIKE '%buy now%';

Current behavior note:

Reserved keywords used by these features:

Hierarchy (Axes)

Axes let you filter by relationships:

Examples:

SELECT span FROM doc WHERE parent.tag = 'div';
SELECT div FROM doc WHERE descendant.attributes.data-testid = 'review-text';
SELECT a FROM doc WHERE ancestor.id = 'content';

Important parser detail for axis attributes:

EXISTS predicate:

Examples:

SELECT div FROM doc WHERE EXISTS(child);
SELECT div FROM doc WHERE EXISTS(child WHERE tag = 'h2');
SELECT div FROM doc WHERE EXISTS(child WHERE tag = 'span' AND attributes.class = 'price');

Projections

Project fields:

SELECT link.href FROM doc;
SELECT div(node_id, tag, parent_id) FROM doc;

Project functions:

SELECT inner_html(div) FROM doc WHERE id = 'card';
SELECT raw_inner_html(div) FROM doc WHERE id = 'card';
SELECT trim(inner_html(div)) FROM doc WHERE id = 'card';
SELECT text(div) FROM doc WHERE attributes.class = 'summary';
SELECT lower(replace(trim(text(div)), ' ', '-')) AS slug FROM doc WHERE attributes.class = 'summary';

Notes:

SQL String Functions

Available in SELECT, WHERE, and inside PROJECT(...) expressions:

Examples:

SELECT CONCAT(attributes.class, '-x') AS label
FROM doc
WHERE tag = 'div';
SELECT SUBSTRING(TRIM(TEXT(div)), 1, 10) AS preview
FROM doc
WHERE attributes.id = 'card';
SELECT REGEX_REPLACE(TRIM(TEXT(div)), '[^0-9]', '') AS digits
FROM doc
WHERE attributes.class = 'price';

FLATTEN_TEXT / FLATTEN

FLATTEN_TEXT is for extracting ordered text slices from descendant nodes.

SELECT FLATTEN_TEXT(div) AS (date, body)
FROM doc
WHERE attributes.class = 'review'
  AND descendant.attributes.data-testid CONTAINS ANY ('review-date', 'review-text');

Alias:

SELECT FLATTEN(div) AS (value) FROM doc WHERE descendant.tag = 'span';

Common mistakes:

PROJECT

PROJECT is for stable field extraction per base row using expression mapping.

Supported expression forms:

Example:

SELECT tr.node_id,
PROJECT(tr) AS (
  period: TEXT(td WHERE sibling_pos = 1),
  pdf_direct: COALESCE(
    ATTR(a, href WHERE parent.sibling_pos = 3 AND href CONTAINS '.pdf'),
    TEXT(td WHERE sibling_pos = 3)
  ),
  excel_direct: COALESCE(
    ATTR(a, href WHERE parent.sibling_pos = 5 AND href CONTAINS '.xlsx'),
    TEXT(td WHERE sibling_pos = 5)
  )
)
FROM doc
WHERE EXISTS(child WHERE tag = 'td');

Notes:

Output Modes

TO LIST

Single projected column to JSON list:

SELECT a.href FROM doc WHERE href IS NOT NULL TO LIST();

TO TABLE

Extract HTML table rows. Default behavior is backward compatible: without new options, output matches prior rectangular table output.

SELECT table FROM doc TO TABLE();
SELECT table FROM doc TO TABLE(HEADER=OFF);
SELECT table FROM doc WHERE id = 'stats' TO TABLE(EXPORT='stats.csv');

Options:

Semantics:

Trim example:

SELECT table FROM doc
TO TABLE(TRIM_EMPTY_ROWS=ON, TRIM_EMPTY_COLS=TRAILING);

Sparse LONG example:

SELECT table FROM doc
TO TABLE(FORMAT=SPARSE, SPARSE_SHAPE=LONG, HEADER=ON);

Sparse WIDE example:

SELECT table FROM doc
TO TABLE(FORMAT=SPARSE, SPARSE_SHAPE=WIDE, HEADER=ON);

Determinism:

TO CSV / TO PARQUET

SELECT a.href, TEXT(a) FROM doc WHERE href IS NOT NULL TO CSV('links.csv');
SELECT * FROM doc TO PARQUET('nodes.parquet');

By default, exported column names are normalized to identifier-safe names (for example data-id -> data_id).

TO JSON / TO NDJSON

SELECT a.href, TEXT(a) FROM doc WHERE href IS NOT NULL TO JSON('links.json');
SELECT a.href, TEXT(a) FROM doc WHERE href IS NOT NULL TO NDJSON('links.ndjson');

Both also accept empty destination (TO JSON() / TO NDJSON()) to stream to stdout.

REPL Workflow

Useful commands:

csv mode writes rectangular query results directly to stdout. It does not render TO TABLE() results; for extracted HTML tables, use TO TABLE(EXPORT='file.csv').

Column-name modes:

Vim navigation mode:

Great iterative pattern:

  1. .load input
  2. Start with SELECT * ... LIMIT 5
  3. Add WHERE filters
  4. Add projections/functions
  5. Export with TO LIST/TO CSV/TO TABLE

Practical Advanced Use Cases

1) Review extraction

SELECT div.data-review-id, FLATTEN_TEXT(div) AS (review_text)
FROM doc
WHERE attributes.class = 'review'
  AND descendant.attributes.data-testid = 'review-text';

2) Navigation audit

SELECT a.href, TEXT(a)
FROM doc
WHERE ancestor.id = 'navbar' AND href IS NOT NULL
TO CSV('nav_links.csv');

3) Content block quality checks

SELECT section
FROM doc
WHERE attributes.class CONTAINS 'content'
  AND descendant.tag IN ('h1','h2','p');

4) Table extraction for analytics

SELECT table FROM doc WHERE id = 'report' TO TABLE(EXPORT='report.csv');

Troubleshooting

If you get parse errors:

If no rows return:

Self-Discovery Commands

Use built-in metadata queries while learning:

SHOW FUNCTIONS;
SHOW AXES;
SHOW OPERATORS;
DESCRIBE doc;
DESCRIBE language;