README.md in ripper-plus-1.1.0.pre2 vs README.md in ripper-plus-1.2.0

- old
+ new

@@ -1,113 +1,207 @@ # ripper-plus -Ripper is the Ruby parser library packaged with Ruby 1.9. While quite complete, it has some quirks that can make use frustrating. This gem intends to correct them. The most important is bareword resolution in Ruby ASTs. +Ripper is the Ruby parser library packaged with Ruby 1.9. While quite complete, it still has bugs (2 of which I've patched alone while working on Laser, with the fixes targeted for 1.9.3: [mlhs-splats](http://redmine.ruby-lang.org/issues/4364) and [words/qwords](http://redmine.ruby-lang.org/issues/4365)), and it has higher-level quirks that can make use frustrating. ripper-plus is a gem intended to demonstrate what I believe the correct output from Ripper should be, *with the goal that these changes become the standard Ripper behavior*. I do *not* want to invent nor maintain a new AST standard. I do however believe these changes warrant separate implementation and discussion before a potential integration with Ripper, and creating a small library to demonstrate the proposed output seemed the simplest approach. Plus, [Laser](https://github.com/michaeledgar/laser/) needs all these improvements, so I had to do it anyway. -## Bareword Resolution +NB: Ripper is a SAX-style parser; one can construct an AST from the parser events however one wishes. Ripper also has a convenience method `Ripper.sexp`, which generates an Array-based AST directly from the SAX events. I personally use `Ripper.sexp` in my work, so the examples will be of `Ripper.sexp` output. All of the discussion below reflects deficiencies in the underlying SAX parser: they are unavoidable whether one uses `Ripper.sexp` or not. +## tl;dr Version + +If you don't want to read everything, here's what you should take away: + +1. Ripper does not distinguish between reading a local variable and implicit-self, no-paren, no-argument method calls. Due to Ruby's semantics, this must be done by the parser. Example: `y = foo`. Is foo a method call? Ripper does not tell you. +2. Ruby's Bison grammar rejects some invalid syntax not by restricting the grammar, but by parsing with a more permissive grammar and then validating the results. Ripper does report some of these errors in its parse tree, but not all of them, and sometimes inconveniently deep in the parse tree. Example: `proc { |a, a| }` is a syntax error, but Ripper parses it without flinching. + +## Still Here? + +If you use Ripper for anything important, I'll assume you've gotten this far. Quick info about `ripper-plus`: it is a gem I hope will correct the following issues until they are addressed in Ripper proper. To use `ripper-plus`, install it with `gem install ripper-plus`. The gem has two entry points: + + require 'ripper-plus' + # Returns an AST from Ripper, modified as described below + RipperPlus.sexp(any_input_object_Ripper_accepts) + # Transforms a Ripper AST into a RipperPlus AST, as described below + RipperPlus.for_ripper_ast(some_ripper_ast) + # Transforms a Ripper AST into a RipperPlus AST, in-place + RipperPlus.for_ripper_ast(some_ripper_ast, :in_place => true) + +That's it - now read about why exactly ripper-plus exists. + +# Bareword Resolution + In Ruby, a local variable is created when it is assigned to. If Ruby finds a bareword `foo` before an assignment `foo = 5`, then Ruby will call `self.foo()`. Kind of. What happens when you run this code? + def label 'hello' end def print_label label = 'Label: ' + label puts label end print_label + If one takes the intuitive approach, one would assume it prints "Label: hello", right? Afraid not: it errors out. The intuitive approach says that `'Label: ' + label` is run before the assignment to the local variable `label`, so the `label` method will be called and appended to `"Label: "`. In reality, it won't, because local variables aren't created on assignment. Local variables are created *immediately when the Ruby parser parses a left-hand-side local variable*. Before the right-hand-side is even parsed, the local variable has been created. This also results in an infamous anomaly: +[code] def say_what? x = 'hello' unless defined?(x) puts x end - say_what? +[/code] `say_what?` prints a blank line! This is because as soon as the `x` on the LHS is parsed, `x` is a local variable with value `nil`. By the time `defined?(x)` is executed, `x` has long since been a local variable! -## Ripper's Bareword Behavior +This is further complicated by the many ways a local variable can be introduced: -Ripper doesn't carry scope information as it parses, and as such, parses any lone identifier as an AST node of type `:var_ref`. It is up to consumers of the AST to figure out the meaning of the `:var_ref` node. It can be reasonably argued that the semantics of the `:var_ref` node should not be part of the AST, as my thesis adviser pointed out when I complained about this, as the two are syntactically identical. Unfortunately, the meaning of the `var_ref` node comes from the parser itself; any attempt to determine semantics based solely on the AST is simply re-creating the work the parser ordinarily does. Indeed, when Ruby parses the code for execution, it *does* create different internal AST node types upon seeing a method-call bareword and a local variable bareword! +1. Single- or multiple-assignment (including sub-assignment) anywhere an expression is allowed +2. A `for` loop's iterator variable(s) +3. Formal arguments to a method +4. Block arguments and block-local variables +5. Stabby Lambda variables +6. Rescue exception naming (`rescue StandardError => err`) +7. Named Regexp captures in literal matches (`/name: (?<person>\w+)/ =~ 'name: Mike'` creates a local variable named "person" with value "Mike") -I'd like to see this behavior rolled into Ripper proper. Until then, `ripper-plus` does it. +I think that's a complete list, but I may be mistaken. In fact, I had forgotten named captures until I double-checked this list while writing this blog post. I tried to find all paths in `parse.y` to a call to `assignable()`, the C function in Ruby's parser that creates local variable in the current scope, and I think I caught them all, but I may have slipped up. -## ripper-plus +## Ripper's Mistake -By using our knowledge of Ruby's parser, we can recreate the scope-tracking behavior by walking the AST generated by Ripper. We simply must do normal scope tracking, observe the creation of local variables, and ensure that we walk each node of the tree in the order in which those nodes would have been parsed. Most subtrees generated by Ripper are already in this order, with the exception of the modifier nodes (`foo.bar if baz`, `x = bar() while x != nil`). Most importantly, since everything is an expression in Ruby and local variable assignments can occur just about anywhere, exact scoping semantics must be recreated. Corner cases such as this: +Ripper doesn't carry scope information as it parses, and as such, parses any lone identifier as an AST node of type `:var_ref`. It is up to consumers of the AST to figure out the meaning of the `:var_ref` node. It can be reasonably argued that the semantics of the `:var_ref` node should not be part of the AST, as my thesis adviser pointed out when I complained about this, as the two are syntactically identical. Unfortunately, the meaning of the `:var_ref` node comes from the parser itself; any attempt to determine semantics based solely on the Ripper AST must simulate the work the parser ordinarily does. Indeed, when Ruby parses the code for execution, it *does* create different internal node types upon seeing a method-call bareword and a local variable bareword! +I'd like to see this behavior rolled into Ripper proper. Until then, `ripper-plus` is a reasonable replacement. + +## ripper-plus's Approach + +By using our knowledge of Ruby's parser, we can simulate the scope-tracking behavior by walking the AST generated by Ripper. We simply do normal scope tracking, observe the creation of local variables, and ensure that we walk each node of the tree in the order in which those nodes would have been parsed. Most subtrees generated by Ripper are already in this order, with the exception of the modifier nodes (`foo.bar if baz`, `x = bar() while x != nil`). Most importantly, since everything<sup>[*](http://carboni.ca/blog/p/Statements-in-Ruby)</sup> is an expression in Ruby and local variable assignments can occur just about anywhere, exact scoping semantics must be recreated. Every possible introduction of local variables (exception naming, named captures, ...) must be considered as the Ruby parser would. Corner cases such as this: + +[code] def foo(x, y = y) end +[/code] Need to be properly resolved. Did you know that, unlike the `label = label` example above, the above code is equivalent to: +[code] def foo(x, y = y()) end +[/code] -Anyway, ripper-plus turns all method-call `:var_ref` nodes into `:zcall` nodes; the node structure is otherwise unchanged. It runs in O(N) time and O(h) space, where h is the height of the tree. +Why doesn't y end up `nil` in the default case, as it would if you typed `y = y` into a method definition? -## Syntax Errors +Anyway, ripper-plus turns all method-call `:var_ref` nodes into `:zcall` nodes; the node structure is otherwise unchanged. I believe Ripper should already make this distinction, and it can relatively simply: it simply has to re-implement the existing scope-tracking behavior to the relevant Ripper action routines. Not trivial, but `ripper-plus` does it in a couple hundred lines of Ruby. -Not all syntax errors in Ruby fail to parse as valid Ruby. Using `next`/`redo`/`break`/`retry` when they have no meaning will cause a "compile error (SyntaxError)" exception to be raised during parsing. Attempting to use those expressions in a value context will also fail to compile. Ripper, however, doesn't do any of this validation: +# Error Handling - pp Ripper.sexp('x = 5; next x') - #=> - [:program, - [[:assign, [:var_field, [:@ident, "x", [1, 0]]], [:@int, "5", [1, 4]]], - [:next, [:args_add_block, [[:var_ref, [:@ident, "x", [1, 12]]]], false]]]] +As mentioned above, not all invalid syntaxes are excluded by Ruby's CFG production rules: additional validation happens occasionally, and when validation fails, `yyerror` is called, and parsing attempts recovery. The Ruby program will not run if such an error exists, but since parsing continues, additional such errors might be found. These are some of (but likely not all) of the errors I'm describing: -Assigning to set read-only variables is also caught by the Ruby parser, and not at runtime. An easy way to verify this is to put a simple print statement before the offending syntax error: +1. `class name ... end`/`module name ... end` with a non-constant name (`=~ /^[^A-Z]/`) +2. `alias gvar $[0-9]`: you can't alias the numeric capture variables +3. duplicate argument names to a method, block +4. constant/instance variable/class variable/global variable as an argument name +5. assigning a constant in a method definition +6. class/module definition syntax in a method body +7. defining singleton methods on literals using expression syntax (`def (expr).foo; end`) - $ ruby -e 'p 5; $1 = "5"' - -e:1: Can't set variable $1 +Whatever your tool is that is dealing with Ruby ASTs, it likely is concerned with whether the program is valid or not, and any program containing the above errors is invalid: Ruby won't even attempt to run it. Ripper, one would suppose, would also inform consumers of the parse tree that a given program is invalid in these ways. -Ripper catches this by wrapping the offending assignment in an `:assign_error` node. But not all such invalid assignments are caught: +## Ripper's Mistake - pp Ripper.sexp('$1 = 5') - #=> - [:program, - [[:assign, [:assign_error, [:@backref, "$1", [1, 0]]], [:@int, "5", [1, 5]]]]] +Some of these errors are noticed by Ripper, and the offending portion of the syntax tree will be wrapped in a corresponding error node: - pp Ripper.sexp('nil = self') - #=> - [:program, - [[:assign, - [:var_field, [:@kw, "nil", [1, 0]]], - [:var_ref, [:@kw, "self", [1, 6]]]]]] +[code] + # Invalid alias + pp Ripper.sexp('alias $foo $2') + #=>[:program, + [[:alias_error, + [:var_alias, [:@gvar, "$foo", [1, 6]], [:@backref, "$2", [1, 11]]]]]] + # Invalid class name + pp Ripper.sexp('class abc; end') + #=> [:program, + [[:class, + [:const_ref, [:class_name_error, [:@ident, "abc", [1, 6]]]], + nil, + [:bodystmt, [[:void_stmt]], nil, nil, nil]]]] +[/code] -Ripper has at least one other error node type, `:class_name_error`: +The first error, a disallowed alias, results in an `:alias_error` node wrapping the entire `:var_alias` node. This is good: anybody walking this AST would see that the entire contents of the node are semantically invalid *before* any information about the attempted alias. The second error, an invalid class name, results in a `:class_name_error` node deep in the `:class` node structure. This is bad: it puts the onus on every consumer of the AST to check all class/module definitions for `:class_name_error` nodes before assuming *anything* about the meaning of the `:class` node. This unfortunate placement also occurs when a parameter is named with a non-identifier: - pp Ripper.sexp('class abc; end') - #=> - [:program, - [[:class, - [:const_ref, [:class_name_error, [:@ident, "foo", [1, 6]]]], - nil, - [:bodystmt, [[:void_stmt]], nil, nil, nil]]]] +[code] + pp Ripper.sexp('def foo(a, @b); end') + #=> [:program, + [[:def, + [:@ident, "foo", [1, 4]], + [:paren, + [:params, + [[:@ident, "a", [1, 8]], [:param_error, [:@ivar, "@b", [1, 11]]]], + nil, nil, nil, nil]], + [:bodystmt, [[:void_stmt]], nil, nil, nil]]]] +[/code] -Ruby has a few dozen of these syntax errors that are caught at compile time, and most of them are things you would be hard-pressed to justify: re-using argument names (`def foo(x, x); end`), creating classes or modules in method bodies, and so on. Yet Ripper does not provide an easy way to check if these errors are present in a given AST. +Worse, some errors are not caught at all: -I'm not convinced that wrapping offending nodes in different node types is the best way to handle this issue. Firstly, the error nodes are not always high enough in the tree (`:alias_error`, which arises when aliasing `$1, $2, ...`, is a notable exception). The position of the `:class_name_error` node forces every consumption of a `:class` node to check if the name subtree contains a `:class_name_error` node. The whole class is invalid though! If anything, the entire `:class` node should be inside an `:error` node. The same holds true for `:assign` nodes. This isn't really the Ripper designer's faults: Ripper is essentially a separate set of action routines in the same bison grammar used by YARV proper, and as an SAX-style parser, it'd be much harder to push errors upward. +[code] + pp Ripper.sexp('def foo(a, a, a); end') + #=> [:program, + [[:def, + [:@ident, "foo", [1, 4]], + [:paren, + [:params, + [[:@ident, "a", [1, 8]], + [:@ident, "a", [1, 11]], + [:@ident, "a", [1, 14]]], + nil, nil, nil, nil]], + [:bodystmt, [[:void_stmt]], nil, nil, nil]]]] +[/code] -It does seem reasonable, however, that while we are transforming `:var_ref` into `:zcall` nodes, we can find these errors (including ones ignored by Ripper), wrap them in `:error` nodes, and provide a list of exceptions corresponding to the semantic errors. `ripper-plus` does just that. For example, the above error will transform into this: +This is very bad. Now every consumer of the Ripper parse tree must check *all method definitions* for repeated arguments, and bail accordingly. Don't forget about subassignments in block argument declarations; you'll need to check those yourself, too: - [:program, - [[:error, - [:class, - [:const_ref, [:class_name_error, [:@ident, "foo", [1, 6]]]], - nil, - [:bodystmt, [[:void_stmt]], nil, nil, nil]], - 'class/module name must be CONSTANT']]] +[code] + pp Ripper.sexp('something.each { |a, (b, *c, (d, e), a), f| }') + #=> [:program, + [[:method_add_block, + [:call, + [:var_ref, [:@ident, "something", [1, 0]]], :".", + [:@ident, "each", [1, 10]]], + [:brace_block, [:block_var, + [:params, + [[:@ident, "a", [1, 18]], + [:mlhs_paren, + [:mlhs_add_star, + [[:mlhs_paren, [:@ident, "b", [1, 22]]]], + [:@ident, "c", [1, 26]]]], + [:@ident, "f", [1, 41]]], + nil, nil, nil, nil], nil], + [[:void_stmt]]]]]] +[/code] -## You Should be Using ripper-plus +Can your Ripper-consuming code handle that case? -The truth is, everybody who is using Ripper right now *should* be doing *all* of this. Anything short, and you have bugs. [Laser](https://github.com/michaeledgar/laser/) has bugs as a result. It's a pain in the ass to get it all right. `ripper-plus` probably has bugs - I'm not gonna lie, I found one while writing this. But I'm pretty damn sure it's solid. Hopefully, in Ruby 1.9.x, this will be the default. For now, you *should* use ripper-plus. +## ripper-plus's Approach +This is where it gets murky, and I don't have a definitive answer just yet. Right now, `ripper-plus` wraps all offending nodes it finds (which is *not* all such errors - yet) in a generic `:error` node. At least with this solution, anyone walking a `RipperPlus` AST will know when a given node is semantically invalid. However, the code walking the AST must find an `:error` node before it knows the entire AST is meaningless. For a tool like [YARD](http://yardoc.org/) or [Redcar](http://redcareditor.com/), which likely would prefer to keep extracting information from the AST, this seems preferable. The `:error` node could potentially be specified to potentially include descriptions and locations of the syntax errors discovered. + +Yet it also seems convenient to receive, upon attempting to parse an invalid program, a simple list of errors and nothing else. For a tool like Laser, this is far preferable, and wastes a lot less time. I think Laser may be in the minority here. So Laser will do its own thing regardless. + +# Minor Gripes + +1. Flip-flops don't have their own node type. Upon encountering a range, you must check if you are in a conditional context manually. This is annoying. +2. \_\_END\_\_ just stops parsing: there's no getting at the actual data past it from Ripper's output. +3. Ripper.sexp never fails. Given a program that fails to parse, Ripper will simply return the best recovery parse Bison can come up with. (`Ripper.sexp('x*^y#$a') == [:program, [:var_ref, [:@ident, "y", [1, 3]]]]`) +4. This space reserved for future gripes. + +## Conclusion + +Overall, Ripper is very complete for a library covered in "EXPERIMENTAL" warning labels, and gives concise, traditional ASTs. What I've put forward is all I ask after nearly a year of being neck-deep in Ripper output. I think the main two points I've covered need to be addressed in the 1.9 branch of Ruby, and over the coming months, hope to work to get that done. + +In the meantime, I'll be using `ripper-plus` in Laser to statically analyze all kinds of stuff about Ruby, but I look forward to the day where I can add `RipperPlus = Ripper if RUBY_VERSION >= MERGED_VERSION` to ripper-plus's main file. `ripper-plus` is O(N), though it's not blazingly fast: it takes about 20-30 ms on my machine to transform the Ripper AST for [the biggest, ugliest Laser code file: the CFG compiler](https://github.com/michaeledgar/laser/blob/master/lib/laser/analysis/control_flow/cfg_builder.rb) to a RipperPlus AST. It takes 50ms for Ripper to parse it in the first place. (That benchmarked transformation is done in-place: `ripper-plus`'s default behavior is to duplicate the Ripper AST.) + +If you have ideas or thoughts, please discuss them on ruby-talk and/or ruby-core! I don't have comments implemented for a reason: I feel it fragments the center of discussion. + ## Contributing to ripper-plus * Check out the latest master to make sure the feature hasn't been implemented or the bug hasn't been fixed yet * Check out the issue tracker to make sure someone already hasn't requested it and/or contributed it * Fork the project @@ -117,7 +211,6 @@ * Please try not to mess with the Rakefile, version, or history. If you want to have your own version, or is otherwise necessary, that is fine, but please isolate to its own commit so I can cherry-pick around it. ## Copyright Copyright (c) 2011 Michael Edgar. See LICENSE.txt for -further details. - +further details. \ No newline at end of file