Skip to the content.

Chapter 4: Sources and Loading

TL;DR

Source choice controls reproducibility. Use stable sources while developing queries, then switch inputs deliberately for production workflows.

What are MarkQL sources?

A MarkQL source is the input root that supplies the row stream. doc is the canonical parsed input source in CLI runs, but MarkQL also supports file/URL string sources, RAW(...) inline HTML, and PARSE(...) when you need to parse HTML strings into a source.

This matters because source choice affects reproducibility. A query that works on captured local HTML is reproducible for tests. A query that reads from network input may change over time. MarkQL supports both, but you should choose deliberately based on whether you are debugging, testing, or running production extraction.

This may feel unfamiliar if you normally tie scraping logic to browser state directly. In MarkQL, source and query are separate concerns. That separation is practical: you can freeze one HTML fixture and iterate on query semantics quickly.

Note: Source is where determinism starts

Teams often think determinism starts in query syntax. It starts earlier, at input. If the source changes every run, debugging semantic issues becomes noisy. MarkQL’s source system (--input, RAW, PARSE, stdin) is intentionally explicit so you can control that noise.

Rules

Scope

CLI input path
  --input file.html
      -> parsed DOM
      -> available as table: doc
Artifact path
  --input file.mqd
      -> load parsed DOM snapshot
      -> available as table: doc

  --query-file file.mqp
      -> load prepared query
      -> execute against html/stdin/url/mqd input

  --query-file file.mql.j2 --render j2 --vars file.toml
      -> render plain MarkQL text first
      -> then lint/execute that rendered query
RAW/PARSE
  query literal source
      -> parsed in-query
      -> row stream local to that source expression

Listing 4-1: File source via --input

./build/markql --mode plain --color=disabled \
  --query "SELECT section.node_id FROM doc WHERE tag='section' ORDER BY node_id;" \
  --input docs/fixtures/basic.html

Observed output:

[
  {"node_id":6},
  {"node_id":11},
  {"node_id":16}
]

Listing 4-2: Stdin source

printf '<div class="x">stdin</div>' | \
./build/markql --mode plain --color=disabled \
  --query "SELECT div FROM doc WHERE attributes.class = 'x';"

Observed output:

[
  {"node_id":2,"tag":"div","attributes":{"class":"x"},...}
]

Listing 4-3: Inline RAW(...)

./build/markql --mode plain --color=disabled \
  --query "SELECT div FROM RAW('<div class=\"x\">hello</div>');"

Observed output (trimmed):

[
  {"tag":"div","attributes":{"class":"x"},...}
]

Listing 4-4: Deliberate failure (TEXT guard still applies)

Even with explicit source, extraction guard rules remain.

# EXPECT_FAIL: requires a WHERE clause
./build/markql --mode plain --color=disabled \
  --query "SELECT TEXT(div) FROM RAW('<div>hello</div>');"

Observed error:

Error: TEXT()/INNER_HTML()/RAW_INNER_HTML() requires a WHERE clause

Fix:

./build/markql --mode plain --color=disabled \
  --query "SELECT TEXT(div) FROM RAW('<div class=\"x\">hello</div>') WHERE attributes.class = 'x';"

Observed output:

[
  {"text":"hello"}
]

Listing 4-5: PARSE(...) for multiple roots

./build/markql --mode plain --color=disabled \
  --query "SELECT div FROM PARSE('<div id=\"a\">one</div><div id=\"b\">two</div>') AS frag ORDER BY node_id;"

Observed output:

[
  {"attributes":{"id":"a"},...},
  {"attributes":{"id":"b"},...}
]

Compatibility note:

Template query files via --query-file

--query-file can also load a templated query file when you opt in explicitly:

./build/markql \
  --query-file tests/fixtures/render/generic_query.mql.j2 \
  --render j2 \
  --vars tests/fixtures/render/generic_query.toml \
  --rendered-out /tmp/generic_query.mql \
  --lint

This keeps the boundary explicit:

Recommended file naming:

Versioned Artifacts

Experimental status:

MarkQL’s artifact MVP adds two explicit cacheable boundaries:

This keeps the semantic boundary stable:

Create and inspect them from the CLI:

./build/markql --input docs/fixtures/basic.html --write-mqd /tmp/basic.mqd
./build/markql --query "SELECT a.href FROM doc WHERE href IS NOT NULL" --write-mqp /tmp/links.mqp
./build/markql --artifact-info /tmp/basic.mqd

Run a prepared query against a prepared document:

./build/markql --query-file /tmp/links.mqp --input /tmp/basic.mqd

MVP limits:

Security contract:

Prepared-query semantic boundary:

Build note:

Benchmark methodology and current result:

Before/after diagrams

Before
  query correctness depends on live page timing
After
  freeze source -> iterate query -> verify outputs

Common mistakes

Chapter takeaway

Good extraction starts before query syntax: choose input sources that make behavior repeatable, and freeze parsed inputs or prepared queries when repeated-work cost matters more than one-off setup cost.