Skip to the content.

Chapter 4: Sources and Loading

TL;DR

Source choice controls reproducibility. Use stable sources while developing queries, then switch inputs deliberately for production workflows.

What are MarkQL sources?

A MarkQL source is the input root that supplies the row stream. doc is the canonical parsed input source in CLI runs, but MarkQL also supports file/URL string sources, RAW(...) inline HTML, and PARSE(...) when you need to parse HTML strings into a source.

This matters because source choice affects reproducibility. A query that works on captured local HTML is reproducible for tests. A query that reads from network input may change over time. MarkQL supports both, but you should choose deliberately based on whether you are debugging, testing, or running production extraction.

This may feel unfamiliar if you normally tie scraping logic to browser state directly. In MarkQL, source and query are separate concerns. That separation is practical: you can freeze one HTML fixture and iterate on query semantics quickly.

Note: Source is where determinism starts

Teams often think determinism starts in query syntax. It starts earlier, at input. If the source changes every run, debugging semantic issues becomes noisy. MarkQL’s source system (--input, RAW, PARSE, stdin) is intentionally explicit so you can control that noise.

Rules

Scope

CLI input path
  --input file.html
      -> parsed DOM
      -> available as table: doc
RAW/PARSE
  query literal source
      -> parsed in-query
      -> row stream local to that source expression

Listing 4-1: File source via --input

./build/markql --mode plain --color=disabled \
  --query "SELECT section.node_id FROM doc WHERE tag='section' ORDER BY node_id;" \
  --input docs/fixtures/basic.html

Observed output:

[
  {"node_id":6},
  {"node_id":11},
  {"node_id":16}
]

Listing 4-2: Stdin source

printf '<div class="x">stdin</div>' | \
./build/markql --mode plain --color=disabled \
  --query "SELECT div FROM doc WHERE attributes.class = 'x';"

Observed output:

[
  {"node_id":2,"tag":"div","attributes":{"class":"x"},...}
]

Listing 4-3: Inline RAW(...)

./build/markql --mode plain --color=disabled \
  --query "SELECT div FROM RAW('<div class=\"x\">hello</div>');"

Observed output (trimmed):

[
  {"tag":"div","attributes":{"class":"x"},...}
]

Listing 4-4: Deliberate failure (TEXT guard still applies)

Even with explicit source, extraction guard rules remain.

# EXPECT_FAIL: requires a WHERE clause
./build/markql --mode plain --color=disabled \
  --query "SELECT TEXT(div) FROM RAW('<div>hello</div>');"

Observed error:

Error: TEXT()/INNER_HTML()/RAW_INNER_HTML() requires a WHERE clause

Fix:

./build/markql --mode plain --color=disabled \
  --query "SELECT TEXT(div) FROM RAW('<div class=\"x\">hello</div>') WHERE attributes.class = 'x';"

Observed output:

[
  {"text":"hello"}
]

Listing 4-5: PARSE(...) for multiple roots

./build/markql --mode plain --color=disabled \
  --query "SELECT div FROM PARSE('<div id=\"a\">one</div><div id=\"b\">two</div>') AS frag ORDER BY node_id;"

Observed output:

[
  {"attributes":{"id":"a"},...},
  {"attributes":{"id":"b"},...}
]

Compatibility note:

Before/after diagrams

Before
  query correctness depends on live page timing
After
  freeze source -> iterate query -> verify outputs

Common mistakes

Chapter takeaway

Good extraction starts before query syntax: choose input sources that make behavior repeatable.