Chapter 7: Value Extraction with TEXT and ATTR
TL;DR
Extraction functions answer “which value do I take from this kept row?” They do not decide whether a row exists.
What are TEXT, DIRECT_TEXT, and ATTR?
These are stage-2 value extraction functions. TEXT(tag ...) returns text from a selected supplier node, DIRECT_TEXT(tag) returns immediate text children only, and ATTR(tag, name ...) returns an attribute value from a selected supplier node.
They matter because DOM extraction is not just row selection. You must define where each value comes from. MarkQL forces this explicitness so schemas stay understandable. Instead of implicit “best guess” extraction, each field expresses supplier constraints.
This may feel unfamiliar because there is a guard: certain extraction functions require a narrowing WHERE in the query. That guard is intentional. It prevents accidental whole-document extraction and pushes users toward explicit row scope.
Note: Supplier node selection is separate from row selection
When you call
TEXT(h3)insidePROJECT(section), the query is not searching the whole document. It searches supplier nodes relative to the current row scope. If no supplier matches, the field returnsNULLand the row still exists. This is one of the most important distinctions in MarkQL.
Rules
- Use
TEXTwhen you want descendant text from supplier nodes. - Use
DIRECT_TEXTwhen descendant text pollution is a risk. - Use
ATTRfor stable machine-readable values. - Expect null when no supplier matches.
- Use
COALESCEfor optional fields.
Scope
row R kept by outer WHERE
field expression picks supplier S under R
value = function(S)
if no S exists:
value = NULL
row remains in result
Listing 7-1: Deliberate failure (guard)
# EXPECT_FAIL: requires a WHERE clause
./build/markql --mode plain --color=disabled \
--query "SELECT TEXT(section) FROM doc;" \
--input docs/fixtures/basic.html
Observed error:
Error: TEXT()/INNER_HTML()/RAW_INNER_HTML() requires a WHERE clause
The guard reminds you to define row scope before extraction.
Listing 7-2: Correct TEXT extraction with narrowing
./build/markql --mode plain --color=disabled \
--query "SELECT TEXT(section) FROM doc WHERE attributes.data-kind = 'flight' ORDER BY node_id;" \
--input docs/fixtures/basic.html
Observed output (trimmed):
[
{"text":"...1 stop...¥12,300..."},
{"text":"...nonstop...¥8,500..."}
]
Listing 7-3: ATTR extraction
./build/markql --mode plain --color=disabled \
--query "SELECT ATTR(a, href) FROM doc WHERE attributes.rel = 'nav' ORDER BY node_id;" \
--input docs/fixtures/basic.html
Observed output:
[
{"attr":"/home"},
{"attr":"/about"}
]
Listing 7-4: DIRECT_TEXT behavior
./build/markql --mode plain --color=disabled \
--query "SELECT DIRECT_TEXT(span) FROM doc WHERE attributes.class = 'stop' ORDER BY node_id;" \
--input docs/fixtures/basic.html
Observed output:
[
{"direct_text":"1 stop"},
{"direct_text":"nonstop"}
]
Before/after diagrams
Before
implicit extraction from unknown scope
After
row R fixed -> supplier S selected -> value extracted
Common mistakes
- Treating extraction failures as row-filter failures.
Fix: debug row scope and supplier scope separately. - Ignoring
DIRECT_TEXTwhen nested text pollutes matches.
Fix: useDIRECT_TEXTfor immediate-text conditions.
Chapter takeaway
Reliable extraction comes from explicit supplier logic, not from hoping one selector fits every row variation.