Integrate.io Sync is a heterogeneous database replication between MySQL and Amazon Redshift (more db to come!) Implementing such an application requires transformation of an SQL language to another, which is a serious programming language parsing business. We use Ruby's "treetop" to parse SQL statements. Treetop is a PEG parser generator in Ruby. It's a hidden gem that lets you write a parser nice and easy. However, it has its own kinks which could waste many hours of your time if you don't know them. This article will walk you through some of those gotchas hoping that you can save precious time and energy.
Simple and Beautiful
Let's start with a simple example.
integer_as_string.treetop grammar IntegerValue do rule top integer { def value text_value end } end
rule integer negative_integer / non_negative_integer end
rule negative_integer '-' number end
rule non_negative_integer number end
rule number [0-9]+ end end
This defines a parser for integer values. I believe this is self explanatory. The topnode takes one and only one integer and returns its text_value as value. I used the following Ruby script to test the parser.
integer.rb require 'treetop'
Treetop.load ARGV[0]
parser = IntegerValueParser.new
result = parser.parse "-255" puts "nodes: " p result print "value: " p result.value
result = parser.parse "255" puts "nodes: " p result print "value: " p result.value
This script loads a parser specified by the command line argument and parse two strings "-255" and "255" with the parser and returns the results. For each, the script also shows a Treetop parser tree. The result of the integer_as_string.treetop looks like this:
$ ruby integer.rb integer_as_string.treetop nodes: SyntaxNode+Top0+NegativeInteger0 offset=0, "-255" (value,number): SyntaxNode offset=0, "-" SyntaxNode offset=1, "255": SyntaxNode offset=1, "2" SyntaxNode offset=2, "5" SyntaxNode offset=3, "5" value: "-255" nodes: SyntaxNode+Top0 offset=0, "255" (value): SyntaxNode offset=0, "2" SyntaxNode offset=1, "5" SyntaxNode offset=2, "5" value: "255" $
You can see that the values were parsed successfully. Nice, simple, and beautiful. Ignore the node structure for now.
Some Things Can Go Awkward
It makes little sense to handle an integer as a string. Let's change the parser to return an integer value. First, add a custom method intval to rule number
rule number [0-9]+ { def intval text_value.to_i end }
Second, add intval method to rule negative_integer and non_negative_integer. The methods return an integer value of its number. negative_integer returns its negative value, of course.
rule negative_integer '-' number { def intval -number.intval end } end
rule non_negative_integer number { def intval number.intval end } end
Lastly, change the top to return intval instead of text_value
rule top integer { def value intval end } end
That's it. Let's run the script.
$ ruby integer.rb integer_as_integer1.treetop nodes: SyntaxNode+Top0+NegativeInteger1+NegativeInteger0 offset=0, "-255" (value,intval,number): SyntaxNode offset=0, "-" SyntaxNode+Number0 offset=1, "255" (intval): SyntaxNode offset=1, "2" SyntaxNode offset=2, "5" SyntaxNode offset=3, "5" value: -255 nodes: SyntaxNode+Top0+NonNegativeInteger0+Number0 offset=0, "255" (value,intval): SyntaxNode offset=0, "2" SyntaxNode offset=1, "5" SyntaxNode offset=2, "5" value: (eval):117:in `intval': undefined local variable or method `number' for # (NameError) from (eval):10:in `value' from integer.rb:18:in `'
Oops. The parser failed to parse a positive number even though it was able to parse a negative number successfully. The error says 'intval': undefined local variable or method 'number', but non_negative_integer's intval method isdefined for number! Also, if non_negative_integer throws an error, why doesn't negative_integer cause the same error?!
rule negative_integer '-' number { def intval -number.intval end } end
rule non_negative_integer number { def intval number.intval end } end
Instead of getting deep dive into the mystery, which we will do shortly, there is a simple solution for this error. Since non_negative_integer uses the integer value of number as is, we don't need to define the custom method to the rule at all. The following script with the change works like a charm.
integer_as_integer.treetop grammar IntegerValue do rule top integer { def value intval end } end
rule integer negative_integer / non_negative_integer end
rule negative_integer '-' number { def intval -number.intval end } end
rule non_negative_integer number end
rule number [0-9]+ { def intval text_value.to_i end } end end
Execution result
$ ruby integer.rb integer_as_integer.treetop nodes: SyntaxNode+Top0+NegativeInteger1+NegativeInteger0 offset=0, "-255" (value,intval,number): SyntaxNode offset=0, "-" SyntaxNode+Number0 offset=1, "255" (intval): SyntaxNode offset=1, "2" SyntaxNode offset=2, "5" SyntaxNode offset=3, "5" value: -255 nodes: SyntaxNode+Top0+Number0 offset=0, "255" (value,intval): SyntaxNode offset=0, "2" SyntaxNode offset=1, "5" SyntaxNode offset=2, "5" value: 255
Things Get Nasty
Now, modify the parser to return a hex string instead of an integer. This time, we'll have to define a custom method to the rule non_negative_integer which we had an issue earlier. Define a new method hexval to the rules, anyway.
rule negative_integer '-' number { def intval -number.intval end def hexval '%x' % -number.intval end } end
rule non_negative_integer number { def hexval '%x' % number.intval end } end
Also, don't forget to change the top rule to return hexval.
rule top integer { def value hexval end } end
Let's run the script. As half expected, the script failed for a similar error as we saw earlier.
$ ruby integer.rb integer_as_hex1.treetop nodes: SyntaxNode+Top0+NegativeInteger1+NegativeInteger0 offset=0, "-255" (value,intval,hexval,number): SyntaxNode offset=0, "-" SyntaxNode+Number0 offset=1, "255" (intval): SyntaxNode offset=1, "2" SyntaxNode offset=2, "5" SyntaxNode offset=3, "5" value: "..f01" nodes: SyntaxNode+Top0+NonNegativeInteger0+Number0 offset=0, "255" (value,hexval,intval): SyntaxNode offset=0, "2" SyntaxNode offset=1, "5" SyntaxNode offset=2, "5" value: (eval):120:in `hexval': undefined local variable or method `number' for # (NameError) from (eval):10:in `value' from integer.rb:18:in `'
Node Merge
This time there is no way around, so let's tackle the issue straight. First, let's look at the node trees. Especially, compare the top level node of the two nodes, one for negative and one for non-negative.
SyntaxNode+Top0+NegativeInteger1+NegativeInteger0 offset=0, "-255" (value,intval,hexval,number): SyntaxNode+Top0+NonNegativeInteger0+Number0 offset=0, "255" (value,hexval,intval):
You'll notice that the first string ends with NegativeInteger0 for negative_integer while it ends with Number0 for non_negative_integer. Also note that the former has number in the list while the latter does not. In fact, Treetop "merges" a parent node with its child node if it has only one child. The merged node has children of the child node with custom methods defined in both nodes. In case of negative_integer, merge happens from top and stops at negative_integer because it has multiple children ('-' and number) The merged node has child nodes of negative_integer ('-' and number) and methods defined in all nodes (value, intval, hexval) On the other hand, non-negative_integer has only one child number, so the merge continues down to the number node. The merged node has children of number and methods defined in all merged nodes (value, intval, hexval) but no number method because the merged node includes number itself. This is why the error happened.
Solution
You can prevent node merge from happening with a special syntax. Below, <tt1..1 is added to the definition of non_negative_integer. This prevents the node to be merged with its child, number
rule non_negative_integer number 1..1 { def hex '%x' % number.intval end } end
Let's run the script. Wait, we got the same error again!!
$ ruby integer.rb integer_as_hex2.treetop nodes: SyntaxNode+Top0+NegativeInteger1+NegativeInteger0 offset=0, "-255" (value,intval,hex,number): SyntaxNode offset=0, "-" SyntaxNode+Number0 offset=1, "255" (intval): SyntaxNode offset=1, "2" SyntaxNode offset=2, "5" SyntaxNode offset=3, "5" value: "..f01" nodes: SyntaxNode+Top0+NonNegativeInteger0 offset=0, "255" (value,hex): SyntaxNode+Number0 offset=0, "255" (intval): SyntaxNode offset=0, "2" SyntaxNode offset=1, "5" SyntaxNode offset=2, "5" value: (eval):120:in `hex': undefined local variable or method `number' for # (NameError) from (eval):10:in `value' from integer.rb:18:in `'
Another Gotcha
Even though we got the same error, if you looked at the node tree carefully, you'll notice that the merge of non_negative_integer stopped at the node unlike the last time. number is no longer merged into the node and remained as its child. However, also note that non_negative_integer has no number method unlike negative_integer node. This is why the error happened.
SyntaxNode+Top0+NegativeInteger1+NegativeInteger0 offset=0, "-255" (value,intval,hex,number): SyntaxNode+Top0+NonNegativeInteger0 offset=0, "255" (value,hex):
AFAIK, there is no way to give the method to the node, but you can still access the child through elements as follows:
rule non_negative_integer number 1..1 { def hexval '%x' % elements.first.intval end } end
The modified parser below works (finally)
integer_as_hex.treetop grammar IntegerValue do rule top integer { def value hexval end } end rule integer negative_integer / non_negative_integer end
rule negative_integer '-' number { def intval -number.intval end def hexval '%x' % -number.intval end } end
rule non_negative_integer number 1..1 { def hexval '%x' % elements.first.intval end } end
rule number [0-9]+ { def intval text_value.to_i end } end end
Execution result
$ ruby integer.rb integer_as_hex.treetop nodes: SyntaxNode+Top0+NegativeInteger1+NegativeInteger0 offset=0, "-255" (value,intval,hexval,number): SyntaxNode offset=0, "-" SyntaxNode+Number0 offset=1, "255" (intval): SyntaxNode offset=1, "2" SyntaxNode offset=2, "5" SyntaxNode offset=3, "5" value: "..f01" nodes: SyntaxNode+Top0+NonNegativeInteger0 offset=0, "255" (value,hexval): SyntaxNode+Number0 offset=0, "255" (intval): SyntaxNode offset=0, "2" SyntaxNode offset=1, "5" SyntaxNode offset=2, "5" value: "ff" $
You Fix One, You Break Another
Now we have a parser which generates a hex string. Let's make sure that the parser can still produce an integer value. To do it, change the top node to return intval.
rule top integer { def value intval end } end
Execution result. Error again :(
$ ruby integer.rb integer1.treetop nodes: SyntaxNode+Top0+NegativeInteger1+NegativeInteger0 offset=0, "-255" (value,intval,hexval,number): SyntaxNode offset=0, "-" SyntaxNode+Number0 offset=1, "255" (intval): SyntaxNode offset=1, "2" SyntaxNode offset=2, "5" SyntaxNode offset=3, "5" value: -255 nodes: SyntaxNode+Top0+NonNegativeInteger0 offset=0, "255" (value,hexval): SyntaxNode+Number0 offset=0, "255" (intval): SyntaxNode offset=0, "2" SyntaxNode offset=1, "5" SyntaxNode offset=2, "5" value: (eval):10:in `value': undefined local variable or method `intval' for # (NameError) from integer.rb:18:in `'
This time, it says intval is missing in value method of non_negative_integer.
And Fix Again for One Last Time
rule non_negative_integer number 1..1 { def hexval '%x' % elements.first.intval end } end
Since we stopped node merge of non_negative_integer with its child number by adding 1..1, it no longer propagates the intval method defined in the child. You'll have to define a custom method this time.
rule non_negative_integer number 1..1 { def intval elements.first.intval end def hexval '%x' % elements.first.intval end } end
Remember to use elements.first as number method is not available. The following parser (finally for real) works for any return value.
integer.treetop grammar IntegerValue do rule top integer { def value intval end } end rule integer negative_integer / non_negative_integer end
rule negative_integer '-' number { def intval -number.intval end def hexval '%x' % -number.intval end } end
rule non_negative_integer number 1..1 { def intval elements.first.intval end def hexval '%x' % elements.first.intval end } end
rule number [0-9]+ { def intval text_value.to_i end } end end
Execution result.
$ ruby integer.rb integer.treetop nodes: SyntaxNode+Top0+NegativeInteger1+NegativeInteger0 offset=0, "-255" (value,intval,hexval,number): SyntaxNode offset=0, "-" SyntaxNode+Number0 offset=1, "255" (intval): SyntaxNode offset=1, "2" SyntaxNode offset=2, "5" SyntaxNode offset=3, "5" value: -255 nodes: SyntaxNode+Top0+NonNegativeInteger0 offset=0, "255" (value,intval,hexval): SyntaxNode+Number0 offset=0, "255" (intval): SyntaxNode offset=0, "2" SyntaxNode offset=1, "5" SyntaxNode offset=2, "5" value: 255 $
Lessons Learned
- Always be aware of node merge - Use 1..1 when you want to prevent node merge - For a single child node you used 1..1, use elements.first to access the child. Hope this saves your time to struggle with Treetop. Enjoy parsing!