Skip to the content.

Chapter 7: Value Extraction with TEXT and ATTR

TL;DR

Extraction functions answer “which value do I take from this kept row?” They do not decide whether a row exists.

What are TEXT, DIRECT_TEXT, and ATTR?

These are stage-2 value extraction functions. TEXT(tag ...) returns text from a selected supplier node, DIRECT_TEXT(tag) returns immediate text children only, and ATTR(tag, name ...) returns an attribute value from a selected supplier node.

They matter because DOM extraction is not just row selection. You must define where each value comes from. MarkQL forces this explicitness so schemas stay understandable. Instead of implicit “best guess” extraction, each field expresses supplier constraints.

This may feel unfamiliar because there is a guard: certain extraction functions require a narrowing WHERE in the query. That guard is intentional. It prevents accidental whole-document extraction and pushes users toward explicit row scope.

Note: Supplier node selection is separate from row selection

When you call TEXT(h3) inside PROJECT(section), the query is not searching the whole document. It searches supplier nodes relative to the current row scope. If no supplier matches, the field returns NULL and the row still exists. This is one of the most important distinctions in MarkQL.

Rules

Scope

row R kept by outer WHERE
  field expression picks supplier S under R
  value = function(S)
if no S exists:
  value = NULL
  row remains in result

Listing 7-1: Deliberate failure (guard)

# EXPECT_FAIL: requires a WHERE clause
./build/markql --mode plain --color=disabled \
  --query "SELECT TEXT(section) FROM doc;" \
  --input docs/fixtures/basic.html

Observed error:

Error: TEXT()/INNER_HTML()/RAW_INNER_HTML() requires a WHERE clause

The guard reminds you to define row scope before extraction.

Listing 7-2: Correct TEXT extraction with narrowing

./build/markql --mode plain --color=disabled \
  --query "SELECT TEXT(section) FROM doc WHERE attributes.data-kind = 'flight' ORDER BY node_id;" \
  --input docs/fixtures/basic.html

Observed output (trimmed):

[
  {"text":"...1 stop...¥12,300..."},
  {"text":"...nonstop...¥8,500..."}
]

Listing 7-3: ATTR extraction

./build/markql --mode plain --color=disabled \
  --query "SELECT ATTR(a, href) FROM doc WHERE attributes.rel = 'nav' ORDER BY node_id;" \
  --input docs/fixtures/basic.html

Observed output:

[
  {"attr":"/home"},
  {"attr":"/about"}
]

Listing 7-4: DIRECT_TEXT behavior

./build/markql --mode plain --color=disabled \
  --query "SELECT DIRECT_TEXT(span) FROM doc WHERE attributes.class = 'stop' ORDER BY node_id;" \
  --input docs/fixtures/basic.html

Observed output:

[
  {"direct_text":"1 stop"},
  {"direct_text":"nonstop"}
]

Before/after diagrams

Before
  implicit extraction from unknown scope
After
  row R fixed -> supplier S selected -> value extracted

Common mistakes

Chapter takeaway

Reliable extraction comes from explicit supplier logic, not from hoping one selector fits every row variation.