Running the tests:
the simplest thing to do is run "ruby -Ilib test/code/locatetest.rb". 
this will use locate to find as much ruby code on your system and test 
each specimen to see if it can be tokenized correctly (by feeding it to 
testcode/rubylexervsruby.rb, the operation of which is outlined below 
under 'testing strategy').

Interpreting the output of rubylexervsruby.rb (and locatetest):
in rubylexervsruby, i've tried to follow the philosophy that the test program
doesn't print anything unless there's an error. perhaps i haven't followed
this far enough; every run of rubylexervsruby produces a little output, and 
sometimes a run will produce output that doesn't actually indicate a problem, 
or only a low-priority problem. (since locatetest, torment, and test all run
rubylexervsruby over and over, they all produce lots of (mostly harmless) 
output. sorry.)

the following types of output should be ignored:

diff file or chunk headers

lines that look like this:
  executing: ruby testcode/tokentest.rb ...     #normal, 1 for every file
or this:
  warning moved from 24 to 22: ambiguous first argument; put parentheses or even spaces
or this:
  Created warning(s) in new file, line 85: useless use of <=> in void context
or this:
  Removed warning(s) from old file (?!), line 85: useless use of <=> in void context
indicate that a warning was added deleted, or moved. ultimately, these should
go away, but right now it's a low-priority issue.

if you ever see ruby stack dump in rubylexervsruby output, that's certainly
an error (if the input ruby code is valid). 

something that looks like a unidiff chunk body (not header) may indicate 
an error as well. the problem is that sometimes those morpheaous warnings 
sneak through my filter (which is supposed to condense them into a single 
line like those above), so you will see diff chunks where the only real
difference is a warning. here are some examples of the kind of diff chunks 
that should NOT cause alarm:

--:89: warning: useless use of <=> in void context
--:92: warning: useless use of <=> in void context
+-:90: warning: useless use of <=> in void context
 Stack now 0 2 62 300 5 110 365 544

 Shifting token tIDENTIFIER, Entering state 34
-Reading a token: -:318: warning: ambiguous first argument; put parentheses or even spaces
-Next token is token tINTEGER ()
+Reading a token: Next token is token tINTEGER ()
 Reducing stack by rule 476 (line 2382), tIDENTIFIER -> operation

if you look closely, (and are experienced in reading unidiff output), you'll
see that the only difference is a warning. to understand more about how the
unidiff output is created, see the section on testing strategy below.

htree/template.rb:
testing this file prints a small unidiff chunk. analysis indicates that the
problem is because ruby's lexer generates an extra (empty) string content 
token at this point, which mine omits. there's no actual semantic difference
between the two tokenizations, so there's nothing to be concerned about. in
a future release, when my lexer supports the notion of string contents and
string delimiters as separate token types, i'll try to emulate ruby more 
closely. the same case is replicated in p.rb.
(in other words, ignore the error in this file and the identical one in p.rb.)


if you find any output that doesn't look like one of the above exceptions, 
and the input file was valid ruby, please send it to me so that i can add it
to my arsenal of tests.

there are a number of 'ruby' files that i know of out there that actually 
contain syntax errors:
rpcd.rb from freeride -- missing an end
sample1.rb from 1.6 version of tcltk -- not legal in ruby 1.8
bdb.rb from libdb2, 3, and 4 -- not how you declare [] method in ruby

testdata/p.rb (my menagerie of weird test cases) is one of the worst 
offenders; it prints lots of output when tested, but all of the problems
are harmless or minor.

only the 10 first lines of each failing file are printed. the rest, as well
as other intermediate files are kept in the testresults directory. the test
output files are named *.prs.diff. beware: this directory is never cleaned,
and can get quite large. after a large test run, you'll want to empty this 
directory to recover some disk space.

about the directories: tbd

about testcode/dumptokens.rb: tbd

about testcode/tokentest.rb:
a fairly simple-minded test utility; given an input file, it uses RubyLexer
to tokenize it, then prints out each token as it is found. certain small
changes will be made; numeric constants (including char constants) are 
converted to decimal and strings are converted to double-quoted form, where
possible. optional flags can cause other changes: --maxws inserts whitespace
everywhere that it's possible, --implicit inserts parentheses where they 
were left out at call sites. --implicit-all adds parentheses around the lists
following when, for, and rescue keywords. --keepws is the usual mode; 
otherwise a 'symbolic mode' is used wherein newline is represented by '#;',
for instance. note: currently the output will not be valid ruby unless
only the --maxws or --keepws is used. in a future release --implicit will
also be valid ruby, but currently it also puts '*[' and ']' around assignment
right hand sides, which only works most of the time.

about testcode/torment:
finds ruby files by other heuristics (not using locate) and runs each
through rubylexervsruby. this is roughly comparable to locatetest, but
more complicated and (probably) less comprehensive.

about ./test:
this contains a number of ruby files which have failed on my Debian system 
in the past. as the paths are hard-coded, it's unlikely to be very portable.

testing strategy:
this command:
ruby -w -y < $1 2>&1 | grep ^Shift|cut -d" " -f3
gives a list of the types of token, as known to ruby, in a source file $1. the
utility program tokentest.rb runs the lexer against a source file and then simply
prints the tokens out again (perhaps with whitespace inserted between tokens). if
the list of token types in this derived source file, as determined by the above command,
is the same as in the original, we can be pretty confident that ruby and rubylexer are
tokenizing in the same way. since whitespaces are optionally inserted between tokens, it
is unlikely that rubylexer is ever finding two tokens where ruby thinks there's only one.
it is possible, however, that rubylexer is emitting as a single token things that ruby
thinks should be 2 tokens. and in fact, this is the case with strings: ruby divides a
string into string open, string body, and string close tokens with option interpolations,
whereas rubylexer has just a single string token (with subtokens, if interpolations are
present.) this difference in handling accounts in part for rubylexer's inability
to correctly lex certain very complicated strings.