MarkQL Tutorial
This is the canonical tutorial for MarkQL.
If you only read one top-level document in this repository, read this one.
What MarkQL Is
MarkQL is a SQL-style query language for HTML. It treats parsed DOM nodes as rows and lets you filter and project them using familiar query structure: SELECT ... FROM ... WHERE ....
MarkQL exists to solve a practical extraction problem: you usually do not fail at finding one value once; you fail at keeping extraction stable after page updates. MarkQL addresses that by making row selection and field extraction explicit instead of mixing them in ad-hoc traversal code.
This model can feel unusual at first if you are used to one-pass scraping scripts. That is normal. Once the row/field separation clicks, debugging gets much faster because each query has a clear failure point.
Note: The Core Mental Model
Note
MarkQL runs in two stages:
- Outer
WHEREfilters row candidates.- Field expressions compute output values for each kept row.
The most important consequence is:
- Outer
WHEREdecides whether a row exists in output.- Field predicates decide which node supplies a field value.
If a field has no matching supplier, that field becomes
NULL. The row remains unless outerWHEREexcludes it.
Rules To Remember
- Probe candidate rows first. Do not start with a large
PROJECT(...). - Use durable anchors (
data-*, structural patterns) before cosmetic class-only filters. - Add
EXISTS(...)guards when cards can be incomplete variants. - Treat repeated blocks explicitly with
FIRST_...andLAST_.... - Normalize output values in-query (
TRIM,REPLACE) before export. - Keep extraction functions (
TEXT,INNER_HTML,RAW_INNER_HTML) behind an outerWHEREclause.
Scope Diagram
doc
├─ node #A (candidate row) -- outer WHERE --> keep or drop
│ ├─ descendant d1
│ ├─ descendant d2
│ └─ descendant d3
└─ node #B (candidate row) -- outer WHERE --> keep or drop
For each kept row:
PROJECT(row) fields choose suppliers from [row + descendants]
Build and First Command
Build:
./build.sh
Listing 1: Verify the CLI and inspect rows
./build/markql \
--input docs/fixtures/basic.html \
--query "
SELECT section(node_id, tag)
FROM doc
WHERE attributes.class CONTAINS 'result'
LIMIT 5;
"
Observed output:
node_id tag
6 section
11 section
16 section
Rows: 3
Why this first listing matters: it confirms parsing works and shows that your row anchor (section with class containing result) is real before field extraction begins.
Query Shape
General structure:
SELECT <items>
FROM <source>
[WHERE <predicate>]
[ORDER BY <field> [ASC|DESC]]
[LIMIT n]
[TO LIST() | TO TABLE() | TO CSV('file.csv') | TO PARQUET('file.parquet') | TO JSON(['file.json']) | TO NDJSON(['file.ndjson'])];
Common sources:
FROM doc
FROM document
FROM 'local.html'
FROM 'https://example.com'
FROM RAW('<div>...</div>')
Alias Naming Convention (Recommended Style)
This is a readability convention, not a language rule.
FROM docbinds an implicit row alias nameddoc.FROM doc AS node_docrebinds the row alias tonode_doc;doc.*is no longer bound in that scope.- Use
node_<semantic>for DOM node rows. - Use
r_<semantic>for CTE/derived logical rows.
Examples:
SELECT self
FROM doc AS node_doc;
Compatibility note:
SELECT <from_alias>remains accepted for backward compatibility in query shapes where it already behaved as row-node return sugar.- Lint now warns for that ambiguous pattern and recommends
SELECT self.
WITH r_rows AS (
SELECT node_row.node_id AS row_id
FROM doc AS node_row
WHERE node_row.tag = 'tr'
)
SELECT r_rows.row_id
FROM r_rows;
Short table-trimming example:
SELECT table
FROM doc
TO TABLE(TRIM_EMPTY_ROWS=ON, TRIM_EMPTY_COLS=TRAILING);
This keeps the table rectangular, drops fully empty padding rows, and trims only right-edge empty columns.
Running Scripts
You can run multiple statements from a .sql file:
./build/markql \
--query-file ./queries/report.sql \
--input docs/fixtures/basic.html
Script rules:
- Statements are separated by
;. - A trailing semicolon is allowed.
- Empty statements are ignored.
- By default execution stops on the first failing statement.
- Use
--continue-on-errorto execute remaining statements. - Use
--quietto suppress== stmt i/N ==delimiters.
Comments are standard SQL:
-- line comment/* block comment */
Example script:
-- summarize available functions
SHOW FUNCTIONS;
/* extract cards */
SELECT section.node_id, section.tag
FROM doc
WHERE attributes.class CONTAINS 'result';
REPL comment-only input is a no-op:
markql> -- just a comment
markql> /* block comment */
markql>
DOM Explorer
Use explore mode when you want to inspect DOM structure before writing queries:
./build/markql explore docs/fixtures/basic.html
From REPL:
markql> .explore
markql> .explore doc
markql> .explore https://example.com/page.html
Keybindings:
Up/Down: move selection in the treeRightorEnter: expand selected nodeLeft: collapse selected node/: start search on node content (default mode: exact, case-insensitive)m: toggle search mode between exact and fuzzyn/N: jump next/previous search match (auto-expands ancestor nodes to reveal hit)g/G: generate MarkQL suggestion for current selected nodey/Y: copy suggested MarkQL to clipboardC: collapse all expanded branches and clear search statej/k: scroll theInner HTML Headpane down/up+/-: zoom in/out theInner HTML HeadpaneEnter: keep current search results and leave search modeEsc: cancel search mode and clear queryq: quit
Search preview behavior:
- Exact mode uses contiguous case-insensitive matching.
- Fuzzy mode uses case-insensitive ordered-character matching.
- Search scopes are evaluated in this priority order:
- self attributes (
id,class,data-*, key or value) - self tag name and self direct text
- descendant content (
inner_html)
- self attributes (
- Ranking is applied in this order:
- source priority (scopes above)
- match quality (
whole-word>word-start>other) - node depth (deeper/closer node first)
- earlier match position
- lower
node_id(tie-break)
- When a match is selected,
Inner HTML Headauto-focuses around the match and color-highlights the matched term. - Use
j/kto move around the nearby context after auto-focus. - Search input accepts UTF-8 text (for example Japanese and Khmer) and symbols like
-/_. - Search execution is debounced while typing; auto-search starts at 2+ characters.
- Press
Enterto run immediately (including 1-character queries).
Session behavior:
- Explorer restores your last expanded nodes, selected node, search query, and zoom for the same input during the current MarkQL session.
- Explorer also restores the current search mode (exact or fuzzy) for that input.
Layout:
- Left pane: collapsed tree (
node_id tagwith compact attribute hints) - Left pane bottom:
Suggestbox with generated full MarkQL statement (manual trigger viag/G) - Right pane: boxed sections for
Node,Inner HTML Head(formatted preview), andAttributes
From Row Probe To Real Extraction
Listing 2: A minimal PROJECT extraction
./build/markql \
--input docs/fixtures/basic.html \
--query "
SELECT section.node_id,
PROJECT(section) AS (
city: TEXT(h3),
stops: TEXT(span WHERE DIRECT_TEXT(span) LIKE '%stop%'),
price_text: TEXT(span WHERE attributes.role = 'text')
)
FROM doc
WHERE attributes.class CONTAINS 'result'
AND attributes.data-kind = 'flight'
ORDER BY node_id;
"
Observed output:
node_id city stops price_text
6 Tokyo 1 stop ¥12,300
11 Osaka nonstop ¥8,500
Rows: 2
What happened:
- Outer
WHEREkept only flight cards. - For each row,
city,stops, andprice_textwere computed independently. DIRECT_TEXT(span) LIKE '%stop%'selected the stop label supplier node without relying on brittle positional assumptions.
Deliberate Failure: Field Extraction Guard
Listing 3: Failing query
./build/markql \
--input docs/fixtures/basic.html \
--query "
SELECT TEXT(div)
FROM doc;
"
Observed error:
Error: TEXT()/INNER_HTML()/RAW_INNER_HTML() requires a WHERE clause
Why this fails: this branch requires extraction functions to run under a query-level WHERE guard. This prevents accidental whole-document text extraction.
Listing 4: Corrected query
./build/markql \
--input docs/fixtures/basic.html \
--query "
SELECT TEXT(p)
FROM doc
WHERE attributes.class CONTAINS 'summary';
"
Observed output:
text
Near station
Rows: 1
The fix is not syntax decoration. It forces explicit extraction scope.
Extraction function contract summary:
TEXT()/INNER_HTML()/RAW_INNER_HTML()require an outerWHERE.- The outer
WHEREmust include at least one non-tag self predicate. INNER_HTML(tag)defaults to depth1.INNER_HTML(tag, MAX_DEPTH)uses each row’smax_depthautomatically.
Deliberate Failure: Row Filtering vs Field Selection
Listing 5: Naive row filter (keeps incomplete variant)
./build/markql \
--input docs/fixtures/basic.html \
--query "
SELECT section.node_id,
PROJECT(section) AS (
name: TEXT(h3),
price_text: TEXT(span WHERE attributes.role = 'text')
)
FROM doc
WHERE attributes.class CONTAINS 'result'
ORDER BY node_id;
"
Observed output:
node_id name price_text
6 Tokyo ¥12,300
11 Osaka ¥8,500
16 Kyoto Stay NULL
Rows: 3
Why this happens: outer WHERE admitted both flight and hotel cards. The hotel row has no price supplier node, so only that field becomes NULL.
Listing 6: Corrected with row-quality guard
./build/markql \
--input docs/fixtures/basic.html \
--query "
SELECT section.node_id
FROM doc
WHERE tag = 'section'
AND EXISTS(descendant WHERE attributes.role = 'text')
ORDER BY node_id;
"
Observed output:
node_id
6
11
Rows: 2
This is the key pattern: use EXISTS(...) in outer WHERE to keep rows that have required structural evidence.
FLATTEN vs PROJECT
FLATTEN is fast for prototyping regular blocks. PROJECT is better when schema stability matters.
Listing 7: FLATTEN on semi-regular items
./build/markql \
--input docs/fixtures/products.html \
--query "
SELECT li.node_id,
FLATTEN(li) AS (name, summary, lang1, lang2)
FROM doc
WHERE attributes.class = 'item'
ORDER BY node_id;
"
Observed output:
node_id name summary lang1 lang2
3 Alpha Fast and light EN JP
8 Beta EN NULL NULL
11 Gamma Budget EN TH
Rows: 3
Interpretation:
- Fast and compact for exploration.
- On irregular rows (
Beta), one positional slot shifts and semantic meaning drifts. - This is exactly when you should switch to explicit
PROJECT(...).
Listing 8: Stable mapping with PROJECT
./build/markql \
--input docs/fixtures/products.html \
--query "
SELECT li.node_id,
PROJECT(li) AS (
title: TEXT(h2),
summary: TEXT(p),
language_primary: FIRST_TEXT(span),
language_secondary: LAST_TEXT(span)
)
FROM doc
WHERE attributes.class = 'item'
ORDER BY node_id;
"
Observed output:
node_id title summary language_primary language_secondary
3 Alpha Fast and light EN JP
8 Beta NULL EN EN
11 Gamma Budget EN TH
Rows: 3
Notice that missing values are explicit (NULL) rather than accidental column drift.
String Operations In Real Queries
MarkQL supports SQL-style string operators and functions in SELECT, outer WHERE, and PROJECT fields.
Examples include:
LIKE(%,_)CONCATSUBSTRING/SUBSTRLENGTH/CHAR_LENGTHPOSITION/LOCATEREPLACELOWER/UPPERLTRIM/RTRIM/TRIMDIRECT_TEXTCASE WHEN ... THEN ... ELSE ... END
Listing 9: Text normalization for numeric export
./build/markql \
--input docs/fixtures/basic.html \
--query "
SELECT section.node_id,
PROJECT(section) AS (
price_text: TEXT(span WHERE attributes.role = 'text'),
price_num: TRIM(REPLACE(REPLACE(TEXT(span WHERE attributes.role = 'text'), '¥', ''), ',', ''))
)
FROM doc
WHERE attributes.data-kind = 'flight'
ORDER BY node_id;
"
Observed output:
node_id price_text price_num
6 ¥12,300 12300
11 ¥8,500 8500
Rows: 2
Listing 10: CASE expression
./build/markql \
--input docs/fixtures/basic.html \
--query "
SELECT CASE
WHEN tag = 'section' THEN 'card'
ELSE 'other'
END
FROM doc
WHERE tag = 'section'
LIMIT 2;
"
Observed output:
case
card
card
Rows: 2
JSON and NDJSON Export
Listing 11: JSON array
./build/markql \
--input docs/fixtures/basic.html \
--query "
SELECT a(href, rel)
FROM doc
WHERE tag = 'a'
ORDER BY node_id
TO JSON();
"
Observed output:
[{"href":"/home","rel":"nav"},{"href":"/about","rel":"nav"}]
Listing 12: NDJSON (one object per line)
./build/markql \
--input docs/fixtures/basic.html \
--query "
SELECT a(href, rel)
FROM doc
WHERE tag = 'a'
ORDER BY node_id
TO NDJSON();
"
Observed output:
{"href":"/home","rel":"nav"}
{"href":"/about","rel":"nav"}
RAW Source Example
Listing 13: Querying inline HTML with RAW(...)
./build/markql \
--query "
SELECT a(href)
FROM RAW('<div><a href=\"/x\">X</a><a href=\"/y\">Y</a></div>')
WHERE tag = 'a'
ORDER BY node_id
TO JSON();
"
Observed output:
[{"href":"/x"},{"href":"/y"}]
Practical Troubleshooting Checklist
When a query fails or returns weak output:
- Run a row-only probe first (
node_id,tag,max_depth). - Verify the outer
WHEREselects the right entity type. - Add
EXISTS(...)for mandatory descendants. - Evaluate each
PROJECTfield independently. - Normalize strings only after supplier matching is correct.
- Export to JSON/NDJSON only after schema is stable.
Final Takeaway
MarkQL is easiest when you treat extraction as two clear decisions:
- Keep the right rows.
- Compute each column from explicit suppliers.
Once you enforce that discipline, query maintenance becomes a routine edit instead of a full scraper rewrite.
Next
- Chapter-based path:
book/SUMMARY.md - Case studies:
case-studies/README.md