Building code-tree HTML Template and Markdown Scanner — Extending to Document Formats

Overview

code-tree was originally developed as a static analysis tool for extracting symbols from programming languages like Rust, Go, Python, and TypeScript. However, real-world projects contain not only source code but also HTML templates (Hugo Go Template, Jinja2) and Markdown documents as essential components. This article documents the design decisions and implementation process of extending code-tree to support these document formats.

Background: Why Support Document Formats

The Problem

code-tree’s initial design targeted source code files (.rs, .go, .py, etc.). However, in Hugo-based web projects and Python web frameworks using Jinja2 templates, template files and Markdown content constitute a significant portion of the project.

Excluding these from symbol extraction meant:

Template structural changes weren’t reflected in the context window
Markdown document section structures couldn’t be captured
LLM agents lacked sufficient information to understand the full project

Design Goals

Extract scope-defining symbols (define/block/macro) from HTML templates
Classify template variables and control flow as weak-scope symbols
Extract section structures and block elements from Markdown as symbols
Maintain compatibility with the existing SymbolRecord format

HTML Template Symbol Extraction

Regex Design for Hugo/Jinja Templates

HTML templates cannot be fully parsed by tree-sitter’s HTML parser alone. Constructs like {{ define "main" }} or {% block content %} are not recognized as HTML nodes — they are embedded within text content. This necessitated a combination of regex-based pattern matching and tree-sitter AST integration.

  // Hugo template patterns
static HUGO_BLOCK_RE: Lazy<Regex> = Lazy::new(|| {
    Regex::new(r"\{\{-?\s*(define|block)\s+").unwrap()
});
static HUGO_CONTROL_RE: Lazy<Regex> = Lazy::new(|| {
    Regex::new(r"\{\{-?\s*(if|else|range|with|end)\b").unwrap()
});
static HUGO_VAR_RE: Lazy<Regex> = Lazy::new(|| {
    Regex::new(r"\{\{-?\s*(\.\w+|\$\w+)").unwrap()
});

// Jinja template patterns
static JINJA_BLOCK_RE: Lazy<Regex> = Lazy::new(|| {
    Regex::new(r"\{%-?\s*(extends|block|macro)\s+").unwrap()
});

Strong/Weak Scope Classification

Template symbol classification applies the same strong/weak concept as programming language function/struct symbols:

Strong scope (structure-defining symbols):

Hugo: define, block — template inheritance units
Jinja: extends, block, macro — layout structure definitions

Weak scope (referenced/used symbols):

if/range/with — control flow
.Title/$site — variable references
partial — partial template invocations

This classification enables code-tree’s --scope strong option to display only the template skeleton, while --scope weak includes detailed control flow.

Strong Keyword Constants

  const HUGO_STRONG_KEYWORDS: &[&str] = &["define", "block"];
const JINJA_STRONG_KEYWORDS: &[&str] = &["extends", "block", "macro"];

Only template blocks matching these keywords are classified as strong symbols. All other control flow and variable references become weak symbols.

Markdown Scanner Design

Tree-sitter Query-Based Approach

The initial implementation used manual tree-sitter AST walking, which led to verbose node type checks and required code modifications in multiple locations when adding new block element types.

The refactored approach leverages tree-sitter’s query capability, detecting all block elements with a single query:

  let query = Query::new(
    &lang.into(),
    r#"
    (atx_heading) @heading
    (setext_heading) @heading
    (fenced_code_block) @code_fence
    (block_quote) @blockquote
    (list) @list
    (table) @table
    (html_block) @html_block
    (link_reference_definition) @link_ref
    "#,
)?;

Benefits of this approach:

Adding a new block element requires only one line added to the query string
Pattern matching is delegated to the tree-sitter engine, keeping Rust code simple
Capture names directly determine symbol type

Section Generation Logic

A key characteristic of Markdown documents is that headings define section boundaries. code-tree generates section symbols with the following logic:

Collect all heading positions and levels
Treat content before the first heading as a (preamble) section
Define section range from each heading to the next heading (or end of document)
Generate section names in H{level}: {heading_text} format

  struct Heading {
    level: usize,
    line_no: usize,
    text: String,
}

// Section range calculation
for (i, heading) in headings.iter().enumerate() {
    let start_line = heading.line_no;
    let end_line = if i + 1 < headings.len() {
        headings[i + 1].line_no - 1
    } else {
        lines.len()
    };
    let name = format!("H{}: {}", heading.level, heading.text);
    // Register as section symbol
}

Section range calculation completes in O(n). By first aggregating all heading positions and then determining section boundaries with a linear scan, there is no performance impact even with nested heading structures.

Code Fence Title Extraction

Code fence titles are determined with the following priority:

Extract title from a comment on the line immediately following the fence (# title, // title, )
If the info string contains title=..., use that value
If neither applies, default to fence:{lang}

  "code_fence" => {
    let info = node
        .child_by_field_name("info_string")
        .map_or("", |n| n.utf8_text(source.as_bytes()).unwrap_or(""));
    let lang_label = if info.is_empty() { "plain" } else { info };
    let title = format!("fence:{}", lang_label);
    md_push_symbol(&mut out, rel_path, scope, "codeblock", title, line_no, &text);
}

Symbol Kind Design

Markdown symbol kinds were designed for compatibility with the existing kind system:

kind	Target	name Content
`section`	Heading sections	`H{level}: {heading_text}`
`heading`	Heading lines	Heading text
`codeblock`	Code fences	`fence:{lang}` + title
`table`	GFM tables	Header row
`blockquote`	Block quotes	First line
`list`	List blocks	First item
`html_block`	HTML blocks	First tag
`link_ref`	Link reference definitions	`[label]`

Strong Summary Output Format

To maintain consistency with existing Rust/Go output, a one-symbol-per-line format is used:

    12 [section][abcd1234] H2: Markdown AST (lines 12-58) // ## Markdown AST
  20 [codeblock][beef5678] fence:rust title=example (lines 20-31) // ```rust fn main() { ... }
  40 [table][cafe9999] cols: name|type|desc (lines 40-45) // | name | type | desc |

Test Design

Hugo Template Zoo

HTML template testing uses a comprehensive “template zoo” file covering Hugo’s major features:

  {{ define "main" }}
  {{ $p := . }}
  {{ $site := .Site }}
  {{ $title := cond (ne .Title "") .Title $site.Title }}
  <!-- variables, pipelines, conditionals, loops, partials... -->
{{ end }}

{{ define "inline-snippet" }}
  {{ $p := .p }}
  <div>{{ printf "Inline: %s" $p.Title }}</div>
{{ end }}

This test file verifies:

define blocks are detected as strong symbols
Variable references ($title, .Site) are classified as weak symbols
Control flow detection (if/range/with)
Pipeline syntax handling (| upper | printf)
Recognition of partial/partialCached/template invocations

Integration Test Framework

Adding new language support requires verification through two test functions:

  #[test]
fn detect_lang_covers_main_extensions() {
    assert_eq!(
        code_tree::detect_lang_from_rel_path("a.html"),
        Some(Lang::Html)
    );
}

#[test]
fn scan_file_covers_languages() {
    assert_scan_non_empty(
        Lang::Html,
        "testdata/hugo.html",
        r#"{{ define "main" }} ... {{ end }}"#,
    );
}

The assert_scan_non_empty helper creates a source file in a temporary directory and confirms the scan result is non-empty. Both language detection and symbol extraction are guaranteed by automated testing.

Implementation Challenges

Challenge 1: GFM Dialect Differences

tree-sitter-markdown may use different node names between GFM (GitHub Flavored Markdown) and standard CommonMark. In particular, the table node depends on GFM extensions and is not recognized in environments without them.

Solution: Enable GFM extensions in the tree-sitter-md crate and implement heuristic | separator detection as a fallback.

Challenge 2: Automatic Template Dialect Detection

The same .html file may use Hugo templates ({{ }} syntax) or Jinja2 templates ({% %} syntax), requiring different regex patterns.

Solution: Scan file content — if {{ define/{{ block patterns are found, classify as Hugo; if {% extends/{% block patterns are found, classify as Jinja. When both are present, apply both pattern sets.

Challenge 3: Setext Heading Level Detection

Setext headings use underlines of === (H1) and --- (H2) to specify heading level, requiring different detection logic from atx headings (counting # characters).

Solution:

  let level = if node.kind() == "atx_heading" {
    // Level determined by number of # characters
    node.child(0).unwrap().utf8_text(src.as_bytes()).unwrap().len()
} else {
    // setext: = means H1, - means H2
    if text.contains("=") { 1 } else { 2 }
};

Key Learnings

1. Combining Regex with tree-sitter

For template languages where a guest language (template syntax) is embedded within a host language (HTML), tree-sitter alone is insufficient. A two-stage approach — parsing HTML structure with tree-sitter and detecting template syntax within text nodes with regex — proved practical.

2. Query-Based Symbol Extraction

Using tree-sitter’s query feature centralizes node type detection into query strings. Compared to manual AST walking, the code becomes more concise and adding new block elements is straightforward.

3. Generality of the Section Concept

The Markdown section generation logic is applicable to other document formats (reStructuredText, AsciiDoc). The pattern of collecting heading positions and then determining section boundaries with a linear scan is broadly reusable.

Results Achieved

HTML Template Support: Symbol extraction from both Hugo Go Template and Jinja2
Markdown Scanner: Comprehensive block element extraction including section generation
System Compatibility: Maintained SymbolRecord format, integrated with strong/weak summaries
Extensibility: Adding new template dialects requires only pattern definition additions

Building code-tree HTML Template and Markdown Scanner — Extending to Document Formats

Overview link

Background: Why Support Document Formats link

The Problem link

Design Goals link

HTML Template Symbol Extraction link

Regex Design for Hugo/Jinja Templates link

Strong/Weak Scope Classification link

Strong Keyword Constants link

Markdown Scanner Design link

Tree-sitter Query-Based Approach link

Section Generation Logic link

Code Fence Title Extraction link

Symbol Kind Design link

Strong Summary Output Format link

Test Design link

Hugo Template Zoo link

Integration Test Framework link

Implementation Challenges link

Challenge 1: GFM Dialect Differences link

Challenge 2: Automatic Template Dialect Detection link

Challenge 3: Setext Heading Level Detection link

Key Learnings link

1. Combining Regex with tree-sitter link

2. Query-Based Symbol Extraction link

3. Generality of the Section Concept link

Results Achieved link