Skip to the content.

MarkQL in One Chapter

From brittle scraping scripts to readable, reproducible extraction.

TL;DR

MarkQL lets you query HTML as rows instead of manually walking DOM trees in code. You write what data you want, and MarkQL handles traversal and extraction rules in one place. Start by finding stable rows, then compute columns from those rows.

If this feels different from your usual scraping workflow, that is expected. Most people first learn scraping as a sequence of imperative steps. MarkQL asks you to express intent first: row criteria, then field criteria. After a few runs, that shift usually makes query behavior easier to predict.

The pain: why scraping breaks

If you have done scraping in Python, JavaScript, or any browser automation stack, you already know the failure pattern:

The painful part is not just “DOM changed.” The painful part is where your logic lives. A lot of scraping code mixes three jobs together:

  1. finding candidate blocks,
  2. traversing to nested nodes,
  3. shaping output schema.

When those three jobs are spread across imperative glue code, every small page change can trigger a large debugging session. You are not just fixing one selector; you are re-checking loops, index assumptions, optional branches, and post-processing rules.

That is why scraping feels fragile: your extraction intent is buried inside orchestration code.

A practical way to say this: when extraction logic is spread across traversal code, every selector change forces you to re-read control flow. When extraction logic is centralized, selector changes usually affect one query clause rather than a full pipeline.

The paradigm shift: “How” vs “What”

Traditional scraping often looks like this:

MarkQL shifts the center of gravity:

In plain words:

This can feel unfamiliar in the first few queries, especially if you are used to writing loops first and extraction second. That is normal. Once you separate row selection from field computation, the query becomes easier to read, review, and fix.

That one shift reduces cognitive load. You can review and debug extraction as a query contract instead of reverse-engineering control flow.

The core tradeoff is simple and useful: you give up some ad-hoc flexibility in exchange for clearer semantics. In scraping work, that trade is often worth it because maintainability and reproducibility matter more than quick one-off hacks.

Side-by-side comparison

Traditional extraction code

cards = soup.select("section")
rows = []
for card in cards:
    if card.get("data-kind") != "flight":
        continue

    h3 = card.select_one("h3")
    spans = card.select("span")

    stop_text = None
    price = None
    for sp in spans:
        t = sp.get_text(" ", strip=True)
        if "stop" in t.lower():
            stop_text = t
        if sp.get("role") == "text":
            price = t

    rows.append({
        "title": h3.get_text(strip=True) if h3 else None,
        "stops": stop_text,
        "price": price,
    })

MarkQL query

SELECT section.node_id,
PROJECT(section) AS (
  title: TEXT(h3),
  stops: TEXT(span WHERE DIRECT_TEXT(span) LIKE '%stop%'),
  price: TEXT(span WHERE attributes.role = 'text')
)
FROM doc
WHERE attributes.data-kind = 'flight'
ORDER BY node_id;

The important difference is not fewer lines. The important difference is clarity:

That clarity becomes concrete when you debug. If output rows are wrong, inspect outer WHERE. If columns are wrong, inspect field expressions in PROJECT. The query itself separates those concerns, so the debugging path is shorter and less ambiguous.

The killer feature: stable schema extraction with PROJECT

The fastest “aha” in MarkQL is PROJECT(...).

PROJECT says:

That sounds small, but it solves a big problem: schema drift. In exploratory extraction, FLATTEN is great and fast. But when rows differ (optional nodes, missing blocks), flattened columns can shift meaning. PROJECT lets you pin each column to a field expression, so your schema stays understandable.

You can still use FLATTEN for discovery. But for extraction you want to maintain, PROJECT is usually the bridge from “it works on my sample” to “it keeps working next month.”

In other words, PROJECT makes the extraction contract explicit. Each field is named, each value source is documented by expression, and optional data can be handled with deliberate defaults. That explicitness is what makes handoff and review easier across a team.

Beginner mental model

Think of a page as a table of nodes.

  1. FROM doc gives you rows (nodes).
  2. WHERE keeps rows you care about.
  3. PROJECT (or fields) computes columns for each kept row.
HTML page
   |
   v
[table of nodes]
   |
   |  WHERE (keep rows)
   v
[rows you care about]
   |
   |  PROJECT/TEXT/ATTR (compute columns)
   v
[result table]

Light teaser for what comes next in the book:

That two-step split is the core of reliable MarkQL queries.

Note: one row, then many field decisions

A useful picture is: “one row enters, several field expressions run.”
The row is either kept or dropped once. Field expressions then compute values independently for that kept row.
This is why one field can be NULL while other fields in the same row are valid.

Try it in 60 seconds

Use fixture: docs/fixtures/basic.html

a) Show me the rows

Query idea: inspect the first 5 nodes so you can see actual structure.

./build/markql --mode plain --color=disabled \
  --query "SELECT * FROM doc LIMIT 5;" \
  --input docs/fixtures/basic.html

Observed output:

[
  {
    "attributes": {},
    "doc_order": 0,
    "max_depth": 5,
    "node_id": 0,
    "parent_id": null,
    "tag": "html"
  },
  {
    "attributes": {},
    "doc_order": 1,
    "max_depth": 4,
    "node_id": 1,
    "parent_id": 0,
    "tag": "body"
  },
  {
    "attributes": {"id": "content"},
    "doc_order": 2,
    "max_depth": 3,
    "node_id": 2,
    "parent_id": 1,
    "tag": "main"
  },
  {
    "attributes": {"id": "nav"},
    "doc_order": 3,
    "max_depth": 1,
    "node_id": 3,
    "parent_id": 2,
    "tag": "nav"
  },
  {
    "attributes": {"href": "/home", "rel": "nav"},
    "doc_order": 4,
    "max_depth": 0,
    "node_id": 4,
    "parent_id": 3,
    "tag": "a"
  }
]

This command is your structural sanity check. Before choosing any extraction strategy, confirm what the parser actually produced: tags, ids, parent relations, and document order. Most downstream mistakes get cheaper to fix if this step is done first.

b) Extract something real

Query idea: keep only flight sections and compute title, stops, and price.

./build/markql --mode plain --color=disabled \
  --query "SELECT section.node_id, PROJECT(section) AS (title: TEXT(h3), stops: TEXT(span WHERE DIRECT_TEXT(span) LIKE '%stop%'), price: TEXT(span WHERE attributes.role = 'text')) FROM doc WHERE attributes.data-kind = 'flight' ORDER BY node_id;" \
  --input docs/fixtures/basic.html

Observed output:

[
  {
    "node_id": 6,
    "price": "¥12,300",
    "stops": "1 stop",
    "title": "Tokyo"
  },
  {
    "node_id": 11,
    "price": "¥8,500",
    "stops": "nonstop",
    "title": "Osaka"
  }
]

This output shows a complete extraction contract in a small form:

As chapters progress, syntax becomes richer, but this flow does not change.

Common beginner mistakes

Mistake 1: jumping straight to extraction without stable rows

Bad query:

SELECT TEXT(section) FROM doc;

What goes wrong:

Fix:

SELECT TEXT(section)
FROM doc
WHERE attributes.data-kind = 'flight';

Start with explicit row scope first.

Mistake 2: mixing row filtering with field selection

Bad query pattern:

SELECT section.node_id,
PROJECT(section) AS (
  stop_text: TEXT(span WHERE DIRECT_TEXT(span) LIKE '%stop%')
)
FROM doc
WHERE tag = 'section';

Then expecting only stop rows.

What goes wrong:

Fix (if row must have stop text):

SELECT section.node_id,
PROJECT(section) AS (
  stop_text: TEXT(span WHERE DIRECT_TEXT(span) LIKE '%stop%')
)
FROM doc
WHERE tag = 'section'
  AND EXISTS(descendant WHERE tag = 'span' AND text LIKE '%stop%');

Outer WHERE controls row existence. Field expressions control value sourcing.

Why this matters operationally: if your pipeline expects “only rows with stop text,” stage-1 filtering must encode that requirement. If you leave it in field selection only, you get silent null rows rather than explicit exclusion.

Mistake 3: trying to build schema with ad-hoc flattening too early

Bad pattern:

What goes wrong:

Fix:

Closing

Good extraction is not about writing clever selectors. It is about making intent obvious and repeatable: which rows, which fields, which fallbacks.

MarkQL gives you a path from “I can scrape this page today” to “we can maintain this extractor next quarter.”

Sticky line: first lock the rows, then compute the columns. That one habit will save you the most debugging time in MarkQL.