Chapter 11: Output and Export
TL;DR
MarkQL result semantics stay the same across sinks; only serialization format changes. Pick the sink that matches your downstream workflow.
What are output sinks in MarkQL?
Output sinks are query targets that serialize result rows: TO LIST(), TO CSV(...), TO JSON(...), and TO NDJSON(...). They are not just convenience features; they define interoperability boundaries with downstream tools.
This matters because extraction value is realized downstream. If output shape and format are unstable, downstream systems become fragile. MarkQL keeps sink syntax explicit in the query so result shaping and export intent are reviewable together.
This may feel unfamiliar if you are used to handling output entirely in host language code. In MarkQL, sink intent can be encoded at query level, which reduces glue code and keeps extraction semantics close to serialization semantics.
Note: Sink choice is part of contract design
LISTis scalar-focused.CSVis tabular and easy for spreadsheets/SQL ingestion.JSONis array-oriented payload.NDJSONis stream-friendly and append-friendly. Choosing sink early clarifies column expectations.
Rules
- Use
TO LIST()only for one projected column. - Use
TO CSVfor human-readable table exports. - Use
TO JSONfor array payload integration. - Use
TO NDJSONfor streaming pipelines. - Verify sink constraints with small test exports first.
Tables
TO TABLE() is rectangular by default. It preserves extracted rows/cells as-is unless you opt into trimming or sparse output options.
Defaults:
TRIM_EMPTY_ROWS=OFFTRIM_EMPTY_COLS=OFFEMPTY_IS=BLANK_OR_NULLSTOP_AFTER_EMPTY_ROWS=0(disabled)FORMAT=RECTSPARSE_SHAPE=LONG(used only whenFORMAT=SPARSE)HEADER_NORMALIZE=ONwhenHEADER=ON(ignored otherwise)
Tiny before/after (same fixture shape as tests/fixtures/tables/trailing_empty_rows_and_cols.html):
SELECT table FROM doc TO TABLE();
Keeps trailing padding rows and trailing empty columns.
SELECT table FROM doc
TO TABLE(TRIM_EMPTY_ROWS=ON, TRIM_EMPTY_COLS=TRAILING);
Drops padding rows and trims only right-edge empty columns.
Sparse formats:
SELECT table FROM doc
TO TABLE(FORMAT=SPARSE, SPARSE_SHAPE=LONG, TRIM_EMPTY_ROWS=ON, TRIM_EMPTY_COLS=TRAILING, HEADER=ON);
Returns one record per non-empty cell (row_index, col_index, optional header, value). Use this for pipelines and append-style processing.
SELECT table FROM doc
TO TABLE(FORMAT=SPARSE, SPARSE_SHAPE=WIDE, TRIM_EMPTY_ROWS=ON, TRIM_EMPTY_COLS=TRAILING, HEADER=ON);
Returns one object per data row with only non-empty keys. Use this for per-row object payloads.
Determinism and compatibility:
- With no new options, table output stays backward compatible.
- For a fixed DOM snapshot and fixed options, output is deterministic.
Scope
query result rows
-> sink serializer
-> file or stdout
same row semantics, different wire format
Listing 11-1: JSON array to stdout
./build/markql --mode plain --color=disabled \
--query "SELECT li.node_id, PROJECT(li) AS (name: TEXT(h2)) FROM doc WHERE tag = 'li' ORDER BY node_id TO JSON();" \
--input docs/fixtures/products.html
Observed output:
[{"node_id":"3","name":"Alpha"},{"node_id":"8","name":"Beta"},{"node_id":"11","name":"Gamma"}]
Listing 11-2: NDJSON to stdout
./build/markql --mode plain --color=disabled \
--query "SELECT li.node_id, PROJECT(li) AS (name: TEXT(h2), note: COALESCE(TEXT(p), 'n/a')) FROM doc WHERE tag = 'li' ORDER BY node_id TO NDJSON();" \
--input docs/fixtures/products.html
Observed output:
{"node_id":"3","name":"Alpha","note":"Fast and light"}
{"node_id":"8","name":"Beta","note":"n/a"}
{"node_id":"11","name":"Gamma","note":"Budget"}
Listing 11-3: CSV to file
./build/markql --mode plain --color=disabled \
--query "SELECT li.node_id, PROJECT(li) AS (name: TEXT(h2), note: COALESCE(TEXT(p), 'n/a')) FROM doc WHERE tag = 'li' ORDER BY node_id TO CSV('/tmp/markql_products.csv');" \
--input docs/fixtures/products.html
Observed file /tmp/markql_products.csv:
node_id,name,note
3,Alpha,Fast and light
8,Beta,n/a
11,Gamma,Budget
Listing 11-4: Deliberate failure (TO LIST shape)
# EXPECT_FAIL: TO LIST() requires a single projected column
./build/markql --mode plain --color=disabled \
--query "SELECT a.href, a.tag FROM doc WHERE href IS NOT NULL TO LIST();" \
--input docs/fixtures/basic.html
Observed error:
Error: TO LIST() requires a single projected column
Fix: use one projected value for LIST, or switch to a multi-column sink.
Before/after diagrams
Before
extract -> custom serializer script
After
extract + sink in one query contract
Common mistakes
- Choosing
TO LIST()for multi-column output.
Fix: useCSV,JSON, orNDJSONfor table-shaped results. - Deferring sink choices until late pipeline stages.
Fix: declare sink intent early so shape assumptions stay explicit.
Chapter takeaway
Output is part of the extraction contract, not an afterthought.