Overview

code-tree was originally developed as a static analysis tool for extracting symbols from programming languages like Rust, Go, Python, and TypeScript. However, real-world projects contain not only source code but also HTML templates (Hugo Go Template, Jinja2) and Markdown documents as essential components. This article documents the design decisions and implementation process of extending code-tree to support these document formats.

Background: Why Support Document Formats

The Problem

code-tree’s initial design targeted source code files (.rs, .go, .py, etc.). However, in Hugo-based web projects and Python web frameworks using Jinja2 templates, template files and Markdown content constitute a significant portion of the project.

Excluding these from symbol extraction meant:

  • Template structural changes weren’t reflected in the context window
  • Markdown document section structures couldn’t be captured
  • LLM agents lacked sufficient information to understand the full project

Design Goals

  1. Extract scope-defining symbols (define/block/macro) from HTML templates
  2. Classify template variables and control flow as weak-scope symbols
  3. Extract section structures and block elements from Markdown as symbols
  4. Maintain compatibility with the existing SymbolRecord format

HTML Template Symbol Extraction

Regex Design for Hugo/Jinja Templates

HTML templates cannot be fully parsed by tree-sitter’s HTML parser alone. Constructs like {{ define "main" }} or {% block content %} are not recognized as HTML nodes — they are embedded within text content. This necessitated a combination of regex-based pattern matching and tree-sitter AST integration.

  // Hugo template patterns
static HUGO_BLOCK_RE: Lazy<Regex> = Lazy::new(|| {
    Regex::new(r"\{\{-?\s*(define|block)\s+").unwrap()
});
static HUGO_CONTROL_RE: Lazy<Regex> = Lazy::new(|| {
    Regex::new(r"\{\{-?\s*(if|else|range|with|end)\b").unwrap()
});
static HUGO_VAR_RE: Lazy<Regex> = Lazy::new(|| {
    Regex::new(r"\{\{-?\s*(\.\w+|\$\w+)").unwrap()
});

// Jinja template patterns
static JINJA_BLOCK_RE: Lazy<Regex> = Lazy::new(|| {
    Regex::new(r"\{%-?\s*(extends|block|macro)\s+").unwrap()
});
  

Strong/Weak Scope Classification

Template symbol classification applies the same strong/weak concept as programming language function/struct symbols:

Strong scope (structure-defining symbols):

  • Hugo: define, block — template inheritance units
  • Jinja: extends, block, macro — layout structure definitions

Weak scope (referenced/used symbols):

  • if/range/with — control flow
  • .Title/$site — variable references
  • partial — partial template invocations

This classification enables code-tree’s --scope strong option to display only the template skeleton, while --scope weak includes detailed control flow.

Strong Keyword Constants

  const HUGO_STRONG_KEYWORDS: &[&str] = &["define", "block"];
const JINJA_STRONG_KEYWORDS: &[&str] = &["extends", "block", "macro"];
  

Only template blocks matching these keywords are classified as strong symbols. All other control flow and variable references become weak symbols.

Markdown Scanner Design

Tree-sitter Query-Based Approach

The initial implementation used manual tree-sitter AST walking, which led to verbose node type checks and required code modifications in multiple locations when adding new block element types.

The refactored approach leverages tree-sitter’s query capability, detecting all block elements with a single query:

  let query = Query::new(
    &lang.into(),
    r#"
    (atx_heading) @heading
    (setext_heading) @heading
    (fenced_code_block) @code_fence
    (block_quote) @blockquote
    (list) @list
    (table) @table
    (html_block) @html_block
    (link_reference_definition) @link_ref
    "#,
)?;
  

Benefits of this approach:

  • Adding a new block element requires only one line added to the query string
  • Pattern matching is delegated to the tree-sitter engine, keeping Rust code simple
  • Capture names directly determine symbol type

Section Generation Logic

A key characteristic of Markdown documents is that headings define section boundaries. code-tree generates section symbols with the following logic:

  1. Collect all heading positions and levels
  2. Treat content before the first heading as a (preamble) section
  3. Define section range from each heading to the next heading (or end of document)
  4. Generate section names in H{level}: {heading_text} format
  struct Heading {
    level: usize,
    line_no: usize,
    text: String,
}

// Section range calculation
for (i, heading) in headings.iter().enumerate() {
    let start_line = heading.line_no;
    let end_line = if i + 1 < headings.len() {
        headings[i + 1].line_no - 1
    } else {
        lines.len()
    };
    let name = format!("H{}: {}", heading.level, heading.text);
    // Register as section symbol
}
  

Section range calculation completes in O(n). By first aggregating all heading positions and then determining section boundaries with a linear scan, there is no performance impact even with nested heading structures.

Code Fence Title Extraction

Code fence titles are determined with the following priority:

  1. Extract title from a comment on the line immediately following the fence (# title, // title, <!-- title -->)
  2. If the info string contains title=..., use that value
  3. If neither applies, default to fence:{lang}
  "code_fence" => {
    let info = node
        .child_by_field_name("info_string")
        .map_or("", |n| n.utf8_text(source.as_bytes()).unwrap_or(""));
    let lang_label = if info.is_empty() { "plain" } else { info };
    let title = format!("fence:{}", lang_label);
    md_push_symbol(&mut out, rel_path, scope, "codeblock", title, line_no, &text);
}
  

Symbol Kind Design

Markdown symbol kinds were designed for compatibility with the existing kind system:

kindTargetname Content
sectionHeading sectionsH{level}: {heading_text}
headingHeading linesHeading text
codeblockCode fencesfence:{lang} + title
tableGFM tablesHeader row
blockquoteBlock quotesFirst line
listList blocksFirst item
html_blockHTML blocksFirst tag
link_refLink reference definitions[label]

Strong Summary Output Format

To maintain consistency with existing Rust/Go output, a one-symbol-per-line format is used:

    12 [section][abcd1234] H2: Markdown AST (lines 12-58) // ## Markdown AST
  20 [codeblock][beef5678] fence:rust title=example (lines 20-31) // ```rust fn main() { ... }
  40 [table][cafe9999] cols: name|type|desc (lines 40-45) // | name | type | desc |
  

Test Design

Hugo Template Zoo

HTML template testing uses a comprehensive “template zoo” file covering Hugo’s major features:

  {{ define "main" }}
  {{ $p := . }}
  {{ $site := .Site }}
  {{ $title := cond (ne .Title "") .Title $site.Title }}
  <!-- variables, pipelines, conditionals, loops, partials... -->
{{ end }}

{{ define "inline-snippet" }}
  {{ $p := .p }}
  <div>{{ printf "Inline: %s" $p.Title }}</div>
{{ end }}
  

This test file verifies:

  • define blocks are detected as strong symbols
  • Variable references ($title, .Site) are classified as weak symbols
  • Control flow detection (if/range/with)
  • Pipeline syntax handling (| upper | printf)
  • Recognition of partial/partialCached/template invocations

Integration Test Framework

Adding new language support requires verification through two test functions:

  #[test]
fn detect_lang_covers_main_extensions() {
    assert_eq!(
        code_tree::detect_lang_from_rel_path("a.html"),
        Some(Lang::Html)
    );
}

#[test]
fn scan_file_covers_languages() {
    assert_scan_non_empty(
        Lang::Html,
        "testdata/hugo.html",
        r#"{{ define "main" }} ... {{ end }}"#,
    );
}
  

The assert_scan_non_empty helper creates a source file in a temporary directory and confirms the scan result is non-empty. Both language detection and symbol extraction are guaranteed by automated testing.

Implementation Challenges

Challenge 1: GFM Dialect Differences

tree-sitter-markdown may use different node names between GFM (GitHub Flavored Markdown) and standard CommonMark. In particular, the table node depends on GFM extensions and is not recognized in environments without them.

Solution: Enable GFM extensions in the tree-sitter-md crate and implement heuristic | separator detection as a fallback.

Challenge 2: Automatic Template Dialect Detection

The same .html file may use Hugo templates ({{ }} syntax) or Jinja2 templates ({% %} syntax), requiring different regex patterns.

Solution: Scan file content — if {{ define/{{ block patterns are found, classify as Hugo; if {% extends/{% block patterns are found, classify as Jinja. When both are present, apply both pattern sets.

Challenge 3: Setext Heading Level Detection

Setext headings use underlines of === (H1) and --- (H2) to specify heading level, requiring different detection logic from atx headings (counting # characters).

Solution:

  let level = if node.kind() == "atx_heading" {
    // Level determined by number of # characters
    node.child(0).unwrap().utf8_text(src.as_bytes()).unwrap().len()
} else {
    // setext: = means H1, - means H2
    if text.contains("=") { 1 } else { 2 }
};
  

Key Learnings

1. Combining Regex with tree-sitter

For template languages where a guest language (template syntax) is embedded within a host language (HTML), tree-sitter alone is insufficient. A two-stage approach — parsing HTML structure with tree-sitter and detecting template syntax within text nodes with regex — proved practical.

2. Query-Based Symbol Extraction

Using tree-sitter’s query feature centralizes node type detection into query strings. Compared to manual AST walking, the code becomes more concise and adding new block elements is straightforward.

3. Generality of the Section Concept

The Markdown section generation logic is applicable to other document formats (reStructuredText, AsciiDoc). The pattern of collecting heading positions and then determining section boundaries with a linear scan is broadly reusable.

Results Achieved

  1. HTML Template Support: Symbol extraction from both Hugo Go Template and Jinja2
  2. Markdown Scanner: Comprehensive block element extraction including section generation
  3. System Compatibility: Maintained SymbolRecord format, integrated with strong/weak summaries
  4. Extensibility: Adding new template dialects requires only pattern definition additions