using tree-sitter to extract code chunks

I'm working on a codebase indexing project that requires me to extract code into chunks. Tree-sitter¹ turned out to be the perfect tool for this.

What is tree-sitter?

Tree-sitter is a multi-language incremental parsing library and a parser generator tool. It combines three key concepts:

parser generator tool²: takes grammar rules and generates source code into parser code
incremental parsing library: efficiently reparses code by reusing previous results
concrete syntax tree (CST): preserves all tokens including whitespaces and comments

How does it work?

Tree-sitter takes your source code and:

tokenizes it into tokens (e.g., module, function_definition, expression_statement)
builds a tree where each node has types defined by the language grammar
allows you to traverse the tree to extract or analyze code structure

Why incremental?

The incremental parsing feature is what makes tree-sitter special:

reuses previous parsed output so it doesn't need to reparse the entire file
tracks byte changes to only update affected parts of the tree
Makes it extremely fast for real-time editor features

Example: parsing Python code

Let's see what tree-sitter output actually looks like. Here's a simple Python function:

python

def hello(name):
    print(f"Hello {name}")

And here's the tree that tree-sitter generates:

module[0, 0] - [2, 0]
  function_definition[0, 0] - [1, 31]
    name: identifier[0, 4] - [0, 9]
    parameters: parameters[0, 9] - [0, 15]
      identifier[0, 10] - [0, 14]
    body: block[1, 9] - [1, 31]
      expression_statement[1, 9] - [1, 31]
        call[1, 9] - [1, 31]
          function: identifier[1, 9]-[1, 14]
          arguments: argument_list[1, 14] - [1, 31]
            string[1, 15] - [1, 30]
              string_start[1, 15] - [1, 17]
              string_content[1, 17] - [1, 23]
              interpolation[1, 23] - [1, 29]
                expression: identifier[1, 24] - [1, 28]
              string_end[1, 29] - [1, 30]

Each line shows:

node type - what kind of syntax element it is
position - [row, column] ranges showing where it appears in the source

What is tree-sitter?

How does it work?

Why incremental?

Example: parsing Python code

Footnotes