Hacker’s guide to the NodePattern compiler

This documentation is aimed at anyone wanting to understand / modify the NodePattern compiler. It assumes some familiarity with the syntax of NodePattern, as well as the AST produced by the parser gem.

Use the node pattern debugger if you want to quickly try things out.

High level view

The NodePattern compiler uses the same techniques as the parser gem:

  • a Lexer that breaks source into tokens

  • a Parser that uses tokens and a Builder to emit an AST

  • a Compiler that converts this AST into Ruby code

Example:

  • Pattern: (send nil? {:puts :p} $...)

  • Tokens: '(', [:tNODE_TYPE, :send], [:tPREDICATE, :nil?], '{', ...

  • AST: s(:sequence, s(:node_type, :send), s(:predicate, :nil?), s(:union, ...

  • Ruby code:

    node.is_a?(::RuboCop::AST::Node) && node.children.size >= 2 &&
    node.send_type? &&
    node.children[0].nil?() &&
    (union2 = node.children[1]; ...

The different parts are described below

Vocabulary

"node pattern": something that can be matched against a single AST::Node. While (int 42) and #is_fun? both correspond to node patterns, ... (without the parenthesis) is not a node pattern.

"sequence": a node pattern that describes the sequence of children of a node (and its type): (type first_child second_child ...)

"variadic": element of a sequence that can match a variable number of children. (send _ int* ...) has two variadic elements (int* and ...). (send _ :name) contains no variadic element. Note that a sequence is itself never variadic.

"atom": element of a pattern that corresponds with a simple Ruby object. (send nil? :puts (str 'hello')) has two atoms: :puts and 'hello'.

Lexer

The lexer.rb defines Lexer and has the few definitions needed for the lexer to work. The bulk of the processing is in the inherited class that is generated by oedipus_lex

Rules

oedipus_lex generates the Ruby file lexer.rex.rb from the rules defined in lexer.rex.

These rules map a Regexp to code that emits a token.

oedipus_lex aims to be simple and the generated file is readable. It uses StringScanner behind the scene. It selects the first rule that matches, contrary to many lexing tools that prioritize longest match.

Tokens

The Lexer emits tokens with types that are:

  • string for the syntactic symbols (e.g. '(', '$', '...')

  • symbols of the form :tTOKEN_TYPE for the rest (e.g. :tPREDICATE)

Tokens are stored as [type, value], or [type, [value, location]] if locations are emitted.

Generation

Use rake generate:lexer to generate the lexer.rex.rb from lexer.rex file. This is done automatically by rake spec.

the lexer.rex.rb is not under source control, but is included in the gem.

Parser

Similarly to the Lexer, the parser.rb defines Parser and has the few definitions needed for the parser to work. The bulk of the processing is in the inherited class parser.racc.rb that is generated by racc from the rules in parser.y.

Nodes

The Parser emits NodePattern::Node which are similar to RuboCop’s node. They both inherit from parser's Parser::AST::Source::Node, and share additional methods too.

Like for RuboCop’s nodes, some nodes have specicialized classes (e.g. Sequence) while other nodes use the base class directly (e.g. s(:number, 42))

Rules

The rules follow closely the definitions above. In particular a distinction between node_pattern_list, which is a list of node patterns (each term can match a single node), while the more generic variadic_pattern_list is a list of elements, some of which could be variadic, others simple node patterns.

Generation

Similarly to the lexer, use rake generate:parser to generate the parser.racc.rb from parser.y file. This is done automatically by rake spec.

the parser.racc.rb is not under source control, but is included in the gem.

Compiler

The compiler’s core is the Compiler class. It holds the global state (e.g. references to named arguments). The goal of the compiler is to produce matching_code, Ruby code that can be run against an AST::Node, or any Ruby object for that matter.

Packaging of that matching_code into code for a lambda, or method def is handled separately by the MethodDefiner module.

The compilation itself is handled by three subcompilers:

  • NodePatternSubcompiler

  • AtomSubcompiler

  • SequenceSubcompiler

Visitors

The subcompilers use the visitor pattern.

The methods starting with visit_ are used to process the different types of nodes. For a node of type :capture, the method visit_capture will be called, or if none is defined then visit_other_type will be called.

No argument is passed, as the visited node is accessible with the node attribute reader.

NodePatternSubcompiler

Given any NodePattern::Node, it generates the Ruby code that can return true or false for the given node, or node type for sequence head.

var vs access

The subcompiler can be called with the current node stored either in a variable (provided with the var: keyword argument) or via a Ruby expression (e.g. access: 'current_node.children[2]').

The subcompiler will not generate code that executes this access expression more than once or twice. If it might access the node more than that, multiple_access will store the result in a temporary variable (e.g. union).

Sequences

Sequences are the most difficult elements to handle and are deferred to the SequenceSubcompiler.

Atoms

Atoms are handled with visit_other_type, which defers to the AtomSubcompiler and converts that result to a node pattern by appending === cur_node (or === cur_node.type if in sequence head).

This way, the two arguments in (_ #func?(%1) %2) would be compiled differently; %1 would be compiled as param1, while %2 gets compiled as param2 === node.children[1].

Precedence

The code generated has higher or equal precedence to &&, so as to make chaining convenient.

AtomSubcompiler

This subcompiler produces Ruby code that gets evaluated to a Ruby object. E.g. "42", :a_symbol, param1.

A good way to think about it is when it has to be passed as arguments to a function call. For example:

# Pattern '#func(42, %1)' compiles to
func(node, 42, param1)

Note that any node pattern can be output by this subcompiler, but those that don’t correspond to a Ruby literal will be output as a lambda so they can be combined. For example:

# Pattern '#func(int)' compiles to
func(node, ->(compare) { compare.is_a?(::RuboCop::AST::Node) && compare.int_type? })

SequenceSubcompiler

The subcompiler compiles the sequences' terms in turn, keeping track of which children of the AST::Node are being matched.

Variadic terms

The complexity comes from variadic elements, which have complex processing and may make it impossible to know at compile time which children are matched by the subsequent terms.

Example (no variadic terms)

(_type int _ str)

First child must match int, third child must match str. The subcompiler will use children[0] and children[2].

Example (one variadic terms)

(_type int _* str)

First child must match int and last child must match str. The subcompiler will use children[0] and children[-1].

Example (multiple variadic terms)

(_type int+ sym str+)

The subcompiler can not use any integer and children[] to match sym. This must be tracked at runtime in a variable (cur_index).

The subcompiler will use fixed indices before the first variadic element and after the last one.

Node pattern terms

The node pattern terms are delegated to the NodePatternSubcompiler.

In the pattern (:sym :sym), both :sym will be compiled differently because the first :sym is in "sequence head": :sym === node.type and :sym == node.children[0] respectively. The subcompiler indicates if the pattern is in "sequence head" or not, so the NodePatternSubcompiler can produce the right code.

Variadic elements may not (currently) cover the sequence head. As a convenience, (...) is understood as (_ ...). Other types of nodes will raise an error (e.g. (<will not compile>); see Node#in_sequence_head)

Precedence

Like the node pattern subcompiler, it generates code that has higher or equal precedence to &&, so as to make chaining convenient.

Variant: WithMeta

These variants of the Parser / Builder / Lexer generate location information (exactly like the parser gem) for AST nodes as well as comments with their locations (like the parser gem).

Since this information is not typically used when one only wants to define methods, it is not loaded by default.

Variant: Debug

These variants of the Compiler / Subcompilers works by adding tracing code before and after each compilation of NodePatternSubcompiler and SequenceSubcompiler. A unique ID is assigned to each node and the tracing code flips a corresponding switch when the expression is about to be evaluated, and after (joined with && so it only flips the switch if the node was a match). Atoms are not compiled differently as they are not really matchable (when not compiled as a node pattern)