Skip to the content.

Chapter 5: Row Filtering with WHERE

TL;DR

WHERE is the row gate. If row filtering is wrong, no extraction function can rescue the result.

Using self (the current row node)

self refers to the current node for the current row produced by FROM.”

Use self when you want to inspect the current row node directly without guessing a supplier tag.

SELECT self.node_id, self.tag
FROM doc
LIMIT 5;
SELECT self.node_id, self.tag, DIRECT_TEXT(self) AS dt
FROM doc
WHERE DIRECT_TEXT(self) LIKE '%some_substring%';
SELECT self.node_id, self.tag
FROM doc
WHERE EXISTS(descendant WHERE DIRECT_TEXT(self) LIKE '%some_substring%');

What is WHERE in MarkQL?

WHERE is the stage-1 row filter. It runs on each candidate row from FROM and decides whether that row survives into the output. In MarkQL, this is the single place where row existence is decided.

This matters because many extraction mistakes are actually row-filter mistakes. If WHERE is too broad, you get noisy rows and confusing null fields. If WHERE is too narrow, you get empty output and assume extraction is broken. MarkQL’s model asks you to solve row scope first, then values.

This may feel unfamiliar if you expect field functions to implicitly control row inclusion. They do not. They can return null, but null does not remove a row. If row inclusion must depend on a structural fact, put that fact in outer WHERE (often via EXISTS).

Note: WHERE runs over rows that already exist in the stream

MarkQL does not run WHERE over arbitrary text fragments. It runs it over node rows with identity. That means tag, attributes, axis expressions, and operators are all evaluated in row context. This is why WHERE is explainable and debuggable with LIMIT and small projections.

Rules

Scope

FOR each row R in doc:
  evaluate WHERE(R)
  if true -> keep R
  else    -> drop R
R context has:
  R.tag
  R.attributes.*
  axis-relative nodes from R

Listing 5-1: Stable row gate with attributes

./build/markql --mode plain --color=disabled \
  --query "SELECT section.node_id, section.tag FROM doc WHERE attributes.data-kind IN ('flight') ORDER BY node_id;" \
  --input docs/fixtures/basic.html

Observed output:

[
  {"node_id":6,"tag":"section"},
  {"node_id":11,"tag":"section"}
]

The filter is intentionally explicit and readable. If data-kind changes, you know exactly why rows disappeared.

Listing 5-2: LIKE with normalization

./build/markql --mode plain --color=disabled \
  --query "SELECT section.node_id, section.tag FROM doc WHERE attributes.data-kind IS NOT NULL AND LOWER(attributes.data-kind) LIKE '%flight%' ORDER BY node_id;" \
  --input docs/fixtures/basic.html

Observed output:

[
  {"node_id":6,"tag":"section"},
  {"node_id":11,"tag":"section"}
]

This listing demonstrates a common pattern: normalize first (LOWER), match second (LIKE). It is simple but robust.

Listing 5-3: Deliberate failure (CONTAINS misuse)

Naive query:

SELECT section.node_id FROM doc WHERE text CONTAINS 'stop';
# EXPECT_FAIL: CONTAINS supports only attributes
./build/markql --mode plain --color=disabled \
  --query "SELECT section.node_id FROM doc WHERE text CONTAINS 'stop';" \
  --input docs/fixtures/basic.html

Observed error:

Error: CONTAINS supports only attributes

Why this fails: in the current implementation, CONTAINS is attribute-oriented. For textual wildcard matching in row filters, use LIKE over an expression that returns string data.

Fix example:

./build/markql --mode plain --color=disabled \
  --query "SELECT section.node_id FROM doc WHERE attributes.data-kind IS NOT NULL AND LOWER(attributes.data-kind) LIKE '%flight%' ORDER BY node_id;" \
  --input docs/fixtures/basic.html

Before/after diagrams

Before
  text CONTAINS 'x'  (assumed generic contains)
After
  LOWER(attr) LIKE '%x%'  (supported and explicit)

Common mistakes

Chapter takeaway

Treat outer WHERE as your primary correctness boundary for every query.