I'm working on a codebase indexing project that requires me to extract code into chunks. Tree-sitter1 turned out to be the perfect tool for this.
What is tree-sitter?
Tree-sitter is a multi-language incremental parsing library and a parser generator tool. It combines three key concepts:
- parser generator tool2: takes grammar rules and generates source code into parser code
- incremental parsing library: efficiently reparses code by reusing previous results
- concrete syntax tree (CST): preserves all tokens including whitespaces and comments
How does it work?
Tree-sitter takes your source code and:
- tokenizes it into tokens (e.g.,
module,function_definition,expression_statement) - builds a tree where each node has types defined by the language grammar
- allows you to traverse the tree to extract or analyze code structure
Why incremental?
The incremental parsing feature is what makes tree-sitter special:
- reuses previous parsed output so it doesn't need to reparse the entire file
- tracks byte changes to only update affected parts of the tree
- Makes it extremely fast for real-time editor features
Example: parsing Python code
Let's see what tree-sitter output actually looks like. Here's a simple Python function:
python
def hello(name):
print(f"Hello {name}")And here's the tree that tree-sitter generates:
module[0, 0] - [2, 0]
function_definition[0, 0] - [1, 31]
name: identifier[0, 4] - [0, 9]
parameters: parameters[0, 9] - [0, 15]
identifier[0, 10] - [0, 14]
body: block[1, 9] - [1, 31]
expression_statement[1, 9] - [1, 31]
call[1, 9] - [1, 31]
function: identifier[1, 9]-[1, 14]
arguments: argument_list[1, 14] - [1, 31]
string[1, 15] - [1, 30]
string_start[1, 15] - [1, 17]
string_content[1, 17] - [1, 23]
interpolation[1, 23] - [1, 29]
expression: identifier[1, 24] - [1, 28]
string_end[1, 29] - [1, 30]Each line shows:
- node type - what kind of syntax element it is
- position -
[row, column]ranges showing where it appears in the source