Chapter 8: FLATTEN
TL;DR
FLATTEN is the fastest way to explore hierarchical text as columns, but it is not a long-term schema guarantee.
What is FLATTEN?
FLATTEN(tag[, depth]) is an extraction shortcut that maps ordered descendant text slices into output columns. It is designed for quick structure discovery and early prototyping when strict per-column supplier logic is not yet needed.
It matters because early exploration often needs speed over precision. You may not yet know the best supplier predicate per field. FLATTEN lets you quickly inspect text ordering and draft a schema. Then, once drift is visible, you can migrate to PROJECT with explicit fields.
This can feel unfamiliar because FLATTEN succeeds even when data shape varies per row, which can produce column drift. That behavior is not a bug; it is the tradeoff. FLATTEN optimizes discovery. PROJECT optimizes stability.
Note:
FLATTENis a prototyping tool, not a final contractIf your source rows can omit fields or reorder chunks, flattened columns can shift. Use flatten output to learn field positions quickly, then codify business columns in
PROJECTwith supplier predicates and fallback logic.
Rules
- Use
FLATTENfirst when you are mapping unknown DOM structure. - Expect drift when rows have optional nodes.
- Keep alias list explicit with
AS (...)for readability. - Move to
PROJECTfor production schema. - Keep row set constrained before flattening.
Scope
kept row R
gather ordered text slices from subtree(R)
assign slices left-to-right to aliases
if row has fewer slices
trailing aliases become NULL
if row has different slice meaning
semantic drift appears
Listing 8-1: Fast flatten mapping
Fixture: docs/fixtures/products.html
./build/markql --mode plain --color=disabled \
--query "SELECT li.node_id, FLATTEN(li) AS (name, note, lang1, lang2) FROM doc WHERE tag = 'li' ORDER BY node_id;" \
--input docs/fixtures/products.html
Observed output:
[
{"node_id":3,"name":"Alpha","note":"Fast and light","lang1":"EN","lang2":"JP"},
{"node_id":8,"name":"Beta","note":"EN","lang1":null,"lang2":null},
{"node_id":11,"name":"Gamma","note":"Budget","lang1":"EN","lang2":"TH"}
]
The second row demonstrates drift: the note column became language text because p was absent.
Listing 8-2: Deliberate failure boundary with aggregate mixing
# EXPECT_FAIL: Aggregate queries require a single select item
./build/markql --mode plain --color=disabled \
--query "SELECT COUNT(*), FLATTEN(li) AS (a) FROM doc WHERE tag='li';" \
--input docs/fixtures/products.html
Observed error:
Error: Aggregate queries require a single select item
This is useful because it shows a parser/validator boundary: aggregate shape rules are enforced before execution.
Listing 8-3: Stable rewrite with PROJECT
./build/markql --mode plain --color=disabled \
--query "SELECT li.node_id, PROJECT(li) AS (name: TEXT(h2), note: COALESCE(TEXT(p), 'n/a'), lang_primary: FIRST_TEXT(span), lang_secondary: TEXT(span, 2)) FROM doc WHERE tag='li' ORDER BY node_id;" \
--input docs/fixtures/products.html
Observed output:
[
{"node_id":3,"name":"Alpha","note":"Fast and light","lang_primary":"EN","lang_secondary":"JP"},
{"node_id":8,"name":"Beta","note":"n/a","lang_primary":"EN","lang_secondary":null},
{"node_id":11,"name":"Gamma","note":"Budget","lang_primary":"EN","lang_secondary":"TH"}
]
Before/after diagrams
Before
FLATTEN -> quick columns, possible drift
After
PROJECT -> explicit suppliers, stable semantics
Common mistakes
- Treating flatten column positions as permanent schema contracts.
Fix: migrate toPROJECTonce exploration is done. - Flattening overly broad row sets.
Fix: constrain rows before flattening to reduce noise.
Chapter takeaway
Use FLATTEN to discover structure quickly, then encode production schema with PROJECT.