Hacker’s guide to the NodePattern
compiler
This documentation is aimed at anyone wanting to understand / modify the NodePattern
compiler.
It assumes some familiarity with the syntax of NodePattern
, as well as the AST produced by the parser
gem.
High level view
The NodePattern
compiler uses the same techniques as the parser
gem:
-
a
Lexer
that breaks source into tokens -
a
Parser
that uses tokens and aBuilder
to emit an AST -
a
Compiler
that converts this AST into Ruby code
Example:
-
Pattern:
(send nil? {:puts :p} $...)
-
Tokens:
'(', [:tNODE_TYPE, :send], [:tPREDICATE, :nil?], '{', ...
-
AST:
s(:sequence, s(:node_type, :send), s(:predicate, :nil?), s(:union, ...
-
Ruby code:
node.is_a?(::RuboCop::AST::Node) && node.children.size >= 2 && node.send_type? && node.children[0].nil?() && (union2 = node.children[1]; ...
The different parts are described below
Vocabulary
"node pattern": something that can be matched against a single AST::Node
.
While (int 42)
and #is_fun?
both correspond to node patterns, ...
(without the parenthesis) is not a node pattern.
"sequence": a node pattern that describes the sequence of children of a node (and its type): (type first_child second_child ...)
"variadic": element of a sequence that can match a variable number of children.
(send _ int* ...)
has two variadic elements (int*
and ...
).
(send _ :name)
contains no variadic element.
Note that a sequence is itself never variadic.
"atom": element of a pattern that corresponds with a simple Ruby object.
(send nil?
:puts (str 'hello'))
has two atoms: :puts
and 'hello'
.
Lexer
The lexer.rb
defines Lexer
and has the few definitions needed for the lexer to work.
The bulk of the processing is in the inherited class that is generated by oedipus_lex
Rules
oedipus_lex
generates the Ruby file lexer.rex.rb
from the rules defined in lexer.rex
.
These rules map a Regexp to code that emits a token.
oedipus_lex
aims to be simple and the generated file is readable.
It uses StringScanner
behind the scene.
It selects the first rule that matches, contrary to many lexing tools that prioritize longest match.
Tokens
The Lexer
emits tokens with types that are:
-
string for the syntactic symbols (e.g.
'('
,'$'
,'...'
) -
symbols of the form
:tTOKEN_TYPE
for the rest (e.g.:tPREDICATE
)
Tokens are stored as [type, value]
, or [type, [value, location]]
if locations are emitted.
Generation
Use rake generate:lexer
to generate the lexer.rex.rb
from lexer.rex
file.
This is done automatically by rake spec
.
the lexer.rex.rb is not under source control, but is included in the gem.
|
Parser
Similarly to the Lexer
, the parser.rb
defines Parser
and has the few definitions needed for the parser to work.
The bulk of the processing is in the inherited class parser.racc.rb
that is generated by racc
from the rules in parser.y
.
Nodes
The Parser
emits NodePattern::Node
which are similar to RuboCop’s node.
They both inherit from parser
's Parser::AST::Source::Node
, and share additional methods too.
Like for RuboCop’s nodes, some nodes have specicialized classes (e.g.
Sequence
) while other nodes use the base class directly (e.g.
s(:number, 42)
)
Rules
The rules follow closely the definitions above.
In particular a distinction between node_pattern_list
, which is a list of node patterns (each term can match a single node), while the more generic variadic_pattern_list
is a list of elements, some of which could be variadic, others simple node patterns.
Generation
Similarly to the lexer, use rake generate:parser
to generate the parser.racc.rb
from parser.y
file.
This is done automatically by rake spec
.
the parser.racc.rb is not under source control, but is included in the gem.
|
Compiler
The compiler’s core is the Compiler
class.
It holds the global state (e.g.
references to named arguments).
The goal of the compiler is to produce matching_code
, Ruby code that can be run against an AST::Node
, or any Ruby object for that matter.
Packaging of that matching_code
into code for a lambda
, or method def
is handled separately by the MethodDefiner
module.
The compilation itself is handled by three subcompilers:
-
NodePatternSubcompiler
-
AtomSubcompiler
-
SequenceSubcompiler
Visitors
The subcompilers use the visitor pattern.
The methods starting with visit_
are used to process the different types of nodes.
For a node of type :capture
, the method visit_capture
will be called, or if none is defined then visit_other_type
will be called.
No argument is passed, as the visited node is accessible with the node
attribute reader.
NodePatternSubcompiler
Given any NodePattern::Node
, it generates the Ruby code that can return true
or false
for the given node, or node type for sequence head.
var
vs access
The subcompiler can be called with the current node stored either in a variable (provided with the var:
keyword argument) or via a Ruby expression (e.g.
access: 'current_node.children[2]'
).
The subcompiler will not generate code that executes this access
expression more than once or twice.
If it might access the node more than that, multiple_access
will store the result in a temporary variable (e.g.
union
).
Sequences
Sequences are the most difficult elements to handle and are deferred to the SequenceSubcompiler
.
Atoms
Atoms are handled with visit_other_type
, which defers to the AtomSubcompiler
and converts that result to a node pattern by appending === cur_node
(or === cur_node.type
if in sequence head).
This way, the two arguments in (_ #func?(%1) %2)
would be compiled differently;
%1
would be compiled as param1
, while %2
gets compiled as param2 === node.children[1]
.
AtomSubcompiler
This subcompiler produces Ruby code that gets evaluated to a Ruby object.
E.g.
"42"
, :a_symbol
, param1
.
A good way to think about it is when it has to be passed as arguments to a function call. For example:
# Pattern '#func(42, %1)' compiles to
func(node, 42, param1)
Note that any node pattern can be output by this subcompiler, but those that don’t correspond to a Ruby literal will be output as a lambda so they can be combined. For example:
# Pattern '#func(int)' compiles to
func(node, ->(compare) { compare.is_a?(::RuboCop::AST::Node) && compare.int_type? })
SequenceSubcompiler
The subcompiler compiles the sequences' terms in turn, keeping track of which children of the AST::Node
are being matched.
Variadic terms
The complexity comes from variadic elements, which have complex processing and may make it impossible to know at compile time which children are matched by the subsequent terms.
Example (no variadic terms)
(_type int _ str)
First child must match int
, third child must match str
.
The subcompiler will use children[0]
and children[2]
.
Example (one variadic terms)
(_type int _* str)
First child must match int
and last child must match str
.
The subcompiler will use children[0]
and children[-1]
.
Example (multiple variadic terms)
(_type int+ sym str+)
The subcompiler can not use any integer and children[]
to match sym
.
This must be tracked at runtime in a variable (cur_index
).
The subcompiler will use fixed indices before the first variadic element and after the last one.
Node pattern terms
The node pattern terms are delegated to the NodePatternSubcompiler
.
In the pattern (:sym :sym)
, both :sym
will be compiled differently because the first :sym
is in "sequence head": :sym === node.type
and :sym == node.children[0]
respectively.
The subcompiler indicates if the pattern is in "sequence head" or not, so the NodePatternSubcompiler
can produce the right code.
Variadic elements may not (currently) cover the sequence head.
As a convenience, (...)
is understood as (_ ...)
.
Other types of nodes will raise an error (e.g.
(<will not compile>)
;
see Node#in_sequence_head
)
Variant: WithMeta
These variants of the Parser / Builder / Lexer generate location
information (exactly like the parser
gem) for AST nodes as well as comments with their locations (like the parser
gem).
Since this information is not typically used when one only wants to define methods, it is not loaded by default.
Variant: Debug
These variants of the Compiler / Subcompilers works by adding tracing code before and after each compilation of NodePatternSubcompiler
and SequenceSubcompiler
.
A unique ID is assigned to each node and the tracing code flips a corresponding switch when the expression is about to be evaluated, and after (joined with &&
so it only flips the switch if the node was a match).
Atoms are not compiled differently as they are not really matchable (when not compiled as a node pattern)