-=RubyLexer 0.6.2=-

RubyLexer is a lexer library for Ruby, written in Ruby. My goal with Rubylexer
was to create a lexer for Ruby that's complete and correct; all legal Ruby 
code should be lexed correctly by RubyLexer as well. Just enough parsing 
capability is included to give RubyLexer enough context to tokenize correctly
in all cases. (This turned out to be more parsing than I had thought or 
wanted to take on at first.)

Other Ruby lexers exist, but most are inadequate. For instance, irb has it's 
own little lexer, as does, (I believe) RDoc, so do all the ide's that can 
colorize. I've seen several stand-alone libraries as well. All or almost all 
suffer from the same problems: they skip the hard part of lexing. RubyLexer 
handles the hard things like complicated strings, the ambiguous nature of 
some punctuation characters and keywords in ruby, and distinguishing methods 
and local variables.

RubyLexer is not particularly clean code. As I progressed in writing this, 
I've learned a little about how these things are supposed to be done; the 
lexer is not supposed to have any state of it's own, instead it gets whatever 
it needs to know from the parser. As a stand-alone lexer, Rubylexer maintains 
quite a lot of state. Every instance variable in the RubyLexer class is some 
sort of lexer state. Most of the complication and ugly code in RubyLexer is 
in maintaining or using this state.

For information about using RubyLexer in your program, please see howtouse.txt.

For my notes on the testing of RubyLexer, see testing.txt.

If you have any questions, comments, problems, new feature requests, or just
want to figure out how to make it work for what you need to do, contact me: 
       rubylexer _at_ inforadical.net

RubyLexer is a RubyForge project. RubyForge is another good place to send your
bug reports or whatever:  http://rubyforge.org/projects/rubylexer/

(There aren't any bug filed against RubyLexer there yet, but don't be afraid 
that your report will get lonely.)

Status:
RubyLexer can correctly lex all legal Ruby 1.8 code that I've been able to 
find on my Debian system. It can also handle (most of) my catalog of nasty 
test cases (in testdata/p.rb). At this point, new bugs are almost exclusively 
found by my home-grown test code, rather than ruby code gathered 'from the 
wild'. A largish sample of ruby recently tested for the first time (that is, 
Rubyx) had _0_ lex errors. (And this is not the only example.) There are a 
number of issues i know about and plan to fix, but it seems that Ruby coders 
don't write code complex enough to trigger them very often. Although 
incomplete, RubyLexer is nevertheless better than many existing ad-hoc 
lexers. For instance, RubyLexer can correctly distinguish all cases of the 
different uses the following operators, depending on context:
  %   can be modulus operator or start of fancy string
  /   can be division operator or start of regex
  * & + -  can be unary or binary operator
  []  can be for array literal or index method
  <<  can be here document or left shift operator (or in class<<obj expr)
  ::  can be unary or binary operator
  :   can be start of symbol, substitute for then, or part of ternary op
      (there are other uses too, but they're not supported yet.)
  ?   can be start of character constant or ternary operator
  `   can be method name or start of exec string

todo:
test w/ more code (rubygems, rpa, obfuscated ruby contest, rubicon, others?)
these 5 should be my standard test suite: p.rb, (matz') test.rb, tk.rb, obfuscated ruby contest, rubicon
test more ways: cvt source to dos or mac fmt before testing
test more ways: run unit tests after passing thru rubylexer (0.7)
test more ways: test require'd, load'd, or eval'd code as well (0.7)
lex code a line (or chunk) at a time and save state for next line (irb wants this) (0.8)
incremental lexing (ides want this (for performance))
put everything in a namespace
integrate w/ other tools...
html colorized output?
move more state onto @bracestack (ongoing)
expand on test documentation above
the new cases in p.rb now compile, but won't run
use want_op_name more
return result as a half-parsed tree (with parentheses and the like matched)
emit advisory tokens when see beginword, then (or equivalent), or end... what else does florian want?
strings are still slow
rantfile
emit advisory tokens when local var defined/goes out of scope (or hidden/unhidden?)
fakefile should be a mixin
token pruning in dumptokens...

new ruby features not yet supported:
procs without proc keyword, looks like hash to current lexer
keyword arguments, in hash immediates or actual param lists (&formal param lists?)
unicode (0.9)
:wrap and friends... (i wish someone would make a list of all the uses of colon in ruby.)
parens in block param list (works, but hacky)


known issues: (and planned fix release)
context not really preserved when entering or leaving string inclusions. this causes
a number or problems. (0.8)
string tokenization sometimes a little different from ruby around newlines
  (htree/template.rb) (0.8)
string contents might not be correctly translated in a few cases (0.8?)
the implicit tokens might be emitted at the wrong times. (or not at the right times) (need more test code) (0.7)
local variables should be temporarily hidden by class, module, and def (0.7)
windows or mac newline in source are likely to cause problems in obscure cases (need test case)
line numbers are sometimes off... probably to do with multi-line strings (=begin...=end causes this) (0.8)
symbols which contain string interpolations are flattened into one token. eg :"foo#{bar}" (0.8)
methnames and varnames might get mixed up in def header (in idents after the 'def' but before param list) (0.7)
FileAndLineToken not emitted everywhere it should be (0.8)
'\r' whitespace sometimes seen in dos-formatted output.. shouldn't be (eg pre.rb) (0.7)
no way to get offset of __END__ (??) (0.7)
put things in lib/


fixed issues in 0.6.2:
testcode/dumptokens.rb charhandler.rb doesn't work... but does after unix2dos (not reproducible)
files should be opened in binmode to avoid all possible eol translation
(x.+?x) doesn't work
methname/varname mixups in some cases
performance, in most important cases.
error handling tokens should be emitted on error input... ErrorToken mixin module
but old error handling interface should be preserved and made available.
move readahead and friends into IOext. make optimized readahead et al for fakefile.
dos newlines (and newlines generally) can't be fancy string delimiters
do,if,until, etc, have no way to tell if an end is associated
break readme into pieces


fixed issues in 0.6.0:
the implicit tokens might be emitted at the wrong times. (or not at the right times) (partly fixed) (0.6)
: operator might be a synonym for 'then' (0.6)
variables other than the last are not recognized in multiple assignment. (0.7)
variables created by for and rescue are not recognized. (0.7)
token following :: should not be BareSymbolToken if begins with A-Z (unless obviously a func, eg b/c followed by func param list)
read code to be lexed from a string. (irb wants this) (0.7)
fancy symbols don't work at all. (like this:  %s{abcdefg}) (0.7) [this is regressing now]
Newline,EscNl,BareSymbolToken may get renamed