Building code-tree HTML Template and Markdown Scanner — Extending to Document Formats
Design and implementation of extending code-tree from programming languages to HTML templates (Hugo/Jinja) and Markdown documents, tree-sitter query-based scanning, and section generation logic
Overview
code-tree was originally developed as a static analysis tool for extracting symbols from programming languages like Rust, Go, Python, and TypeScript. However, real-world projects contain not only source code but also HTML templates (Hugo Go Template, Jinja2) and Markdown documents as essential components. This article documents the design decisions and implementation process of extending code-tree to support these document formats.
Background: Why Support Document Formats
The Problem
code-tree’s initial design targeted source code files (.rs, .go, .py, etc.). However, in Hugo-based web projects and Python web frameworks using Jinja2 templates, template files and Markdown content constitute a significant portion of the project.
Excluding these from symbol extraction meant:
- Template structural changes weren’t reflected in the context window
- Markdown document section structures couldn’t be captured
- LLM agents lacked sufficient information to understand the full project
Design Goals
- Extract scope-defining symbols (
define/block/macro) from HTML templates - Classify template variables and control flow as weak-scope symbols
- Extract section structures and block elements from Markdown as symbols
- Maintain compatibility with the existing SymbolRecord format
HTML Template Symbol Extraction
Regex Design for Hugo/Jinja Templates
HTML templates cannot be fully parsed by tree-sitter’s HTML parser alone. Constructs like {{ define "main" }} or {% block content %} are not recognized as HTML nodes — they are embedded within text content. This necessitated a combination of regex-based pattern matching and tree-sitter AST integration.
// Hugo template patterns
static HUGO_BLOCK_RE: Lazy<Regex> = Lazy::new(|| {
Regex::new(r"\{\{-?\s*(define|block)\s+").unwrap()
});
static HUGO_CONTROL_RE: Lazy<Regex> = Lazy::new(|| {
Regex::new(r"\{\{-?\s*(if|else|range|with|end)\b").unwrap()
});
static HUGO_VAR_RE: Lazy<Regex> = Lazy::new(|| {
Regex::new(r"\{\{-?\s*(\.\w+|\$\w+)").unwrap()
});
// Jinja template patterns
static JINJA_BLOCK_RE: Lazy<Regex> = Lazy::new(|| {
Regex::new(r"\{%-?\s*(extends|block|macro)\s+").unwrap()
});
Strong/Weak Scope Classification
Template symbol classification applies the same strong/weak concept as programming language function/struct symbols:
Strong scope (structure-defining symbols):
- Hugo:
define,block— template inheritance units - Jinja:
extends,block,macro— layout structure definitions
Weak scope (referenced/used symbols):
if/range/with— control flow.Title/$site— variable referencespartial— partial template invocations
This classification enables code-tree’s --scope strong option to display only the template skeleton, while --scope weak includes detailed control flow.
Strong Keyword Constants
const HUGO_STRONG_KEYWORDS: &[&str] = &["define", "block"];
const JINJA_STRONG_KEYWORDS: &[&str] = &["extends", "block", "macro"];
Only template blocks matching these keywords are classified as strong symbols. All other control flow and variable references become weak symbols.
Markdown Scanner Design
Tree-sitter Query-Based Approach
The initial implementation used manual tree-sitter AST walking, which led to verbose node type checks and required code modifications in multiple locations when adding new block element types.
The refactored approach leverages tree-sitter’s query capability, detecting all block elements with a single query:
let query = Query::new(
&lang.into(),
r#"
(atx_heading) @heading
(setext_heading) @heading
(fenced_code_block) @code_fence
(block_quote) @blockquote
(list) @list
(table) @table
(html_block) @html_block
(link_reference_definition) @link_ref
"#,
)?;
Benefits of this approach:
- Adding a new block element requires only one line added to the query string
- Pattern matching is delegated to the tree-sitter engine, keeping Rust code simple
- Capture names directly determine symbol type
Section Generation Logic
A key characteristic of Markdown documents is that headings define section boundaries. code-tree generates section symbols with the following logic:
- Collect all heading positions and levels
- Treat content before the first heading as a
(preamble)section - Define section range from each heading to the next heading (or end of document)
- Generate section names in
H{level}: {heading_text}format
struct Heading {
level: usize,
line_no: usize,
text: String,
}
// Section range calculation
for (i, heading) in headings.iter().enumerate() {
let start_line = heading.line_no;
let end_line = if i + 1 < headings.len() {
headings[i + 1].line_no - 1
} else {
lines.len()
};
let name = format!("H{}: {}", heading.level, heading.text);
// Register as section symbol
}
Section range calculation completes in O(n). By first aggregating all heading positions and then determining section boundaries with a linear scan, there is no performance impact even with nested heading structures.
Code Fence Title Extraction
Code fence titles are determined with the following priority:
- Extract title from a comment on the line immediately following the fence (
# title,// title,<!-- title -->) - If the info string contains
title=..., use that value - If neither applies, default to
fence:{lang}
"code_fence" => {
let info = node
.child_by_field_name("info_string")
.map_or("", |n| n.utf8_text(source.as_bytes()).unwrap_or(""));
let lang_label = if info.is_empty() { "plain" } else { info };
let title = format!("fence:{}", lang_label);
md_push_symbol(&mut out, rel_path, scope, "codeblock", title, line_no, &text);
}
Symbol Kind Design
Markdown symbol kinds were designed for compatibility with the existing kind system:
| kind | Target | name Content |
|---|---|---|
section | Heading sections | H{level}: {heading_text} |
heading | Heading lines | Heading text |
codeblock | Code fences | fence:{lang} + title |
table | GFM tables | Header row |
blockquote | Block quotes | First line |
list | List blocks | First item |
html_block | HTML blocks | First tag |
link_ref | Link reference definitions | [label] |
Strong Summary Output Format
To maintain consistency with existing Rust/Go output, a one-symbol-per-line format is used:
12 [section][abcd1234] H2: Markdown AST (lines 12-58) // ## Markdown AST
20 [codeblock][beef5678] fence:rust title=example (lines 20-31) // ```rust fn main() { ... }
40 [table][cafe9999] cols: name|type|desc (lines 40-45) // | name | type | desc |
Test Design
Hugo Template Zoo
HTML template testing uses a comprehensive “template zoo” file covering Hugo’s major features:
{{ define "main" }}
{{ $p := . }}
{{ $site := .Site }}
{{ $title := cond (ne .Title "") .Title $site.Title }}
<!-- variables, pipelines, conditionals, loops, partials... -->
{{ end }}
{{ define "inline-snippet" }}
{{ $p := .p }}
<div>{{ printf "Inline: %s" $p.Title }}</div>
{{ end }}
This test file verifies:
defineblocks are detected as strong symbols- Variable references (
$title,.Site) are classified as weak symbols - Control flow detection (
if/range/with) - Pipeline syntax handling (
| upper | printf) - Recognition of
partial/partialCached/templateinvocations
Integration Test Framework
Adding new language support requires verification through two test functions:
#[test]
fn detect_lang_covers_main_extensions() {
assert_eq!(
code_tree::detect_lang_from_rel_path("a.html"),
Some(Lang::Html)
);
}
#[test]
fn scan_file_covers_languages() {
assert_scan_non_empty(
Lang::Html,
"testdata/hugo.html",
r#"{{ define "main" }} ... {{ end }}"#,
);
}
The assert_scan_non_empty helper creates a source file in a temporary directory and confirms the scan result is non-empty. Both language detection and symbol extraction are guaranteed by automated testing.
Implementation Challenges
Challenge 1: GFM Dialect Differences
tree-sitter-markdown may use different node names between GFM (GitHub Flavored Markdown) and standard CommonMark. In particular, the table node depends on GFM extensions and is not recognized in environments without them.
Solution: Enable GFM extensions in the tree-sitter-md crate and implement heuristic | separator detection as a fallback.
Challenge 2: Automatic Template Dialect Detection
The same .html file may use Hugo templates ({{ }} syntax) or Jinja2 templates ({% %} syntax), requiring different regex patterns.
Solution: Scan file content — if {{ define/{{ block patterns are found, classify as Hugo; if {% extends/{% block patterns are found, classify as Jinja. When both are present, apply both pattern sets.
Challenge 3: Setext Heading Level Detection
Setext headings use underlines of === (H1) and --- (H2) to specify heading level, requiring different detection logic from atx headings (counting # characters).
Solution:
let level = if node.kind() == "atx_heading" {
// Level determined by number of # characters
node.child(0).unwrap().utf8_text(src.as_bytes()).unwrap().len()
} else {
// setext: = means H1, - means H2
if text.contains("=") { 1 } else { 2 }
};
Key Learnings
1. Combining Regex with tree-sitter
For template languages where a guest language (template syntax) is embedded within a host language (HTML), tree-sitter alone is insufficient. A two-stage approach — parsing HTML structure with tree-sitter and detecting template syntax within text nodes with regex — proved practical.
2. Query-Based Symbol Extraction
Using tree-sitter’s query feature centralizes node type detection into query strings. Compared to manual AST walking, the code becomes more concise and adding new block elements is straightforward.
3. Generality of the Section Concept
The Markdown section generation logic is applicable to other document formats (reStructuredText, AsciiDoc). The pattern of collecting heading positions and then determining section boundaries with a linear scan is broadly reusable.
Results Achieved
- HTML Template Support: Symbol extraction from both Hugo Go Template and Jinja2
- Markdown Scanner: Comprehensive block element extraction including section generation
- System Compatibility: Maintained SymbolRecord format, integrated with strong/weak summaries
- Extensibility: Adding new template dialects requires only pattern definition additions

