Running the tests: The simplest thing to do is run "make test". This tests the lexer with a list of known ruby interesting expressions. It will take several minutes to run. Currently, there are 8-11 (minor) failures, depending or ruby version. The fact that there are a few failures is more a testament to the thoroughness of the test suite than an indictment of the lexer. Both lexer and test suite are very thorough, but a few more (obscure and unlikely) expressions are supported by the latter than the former. Most of the tests in the suite use rubylexervsruby, described below. If you're ambitious, try this command: "ruby -Ilib test/code/locatetest.rb". This will use locate to find as much ruby code on your system and test each specimen to see if it can be tokenized correctly (by feeding it to test/code/rubylexervsruby.rb, the operation of which is outlined below under 'testing strategy'). Interpreting output of rubylexervsruby (and locatetest and 'make test'): The following types of output should be ignored: diff file or chunk headers lines that look like this: executing: ruby testcode/tokentest.rb ... #normal, 1 for every file or this: Created warning(s) in new file, line 85: useless use of <=> in void context or this: Removed warning(s) from old file (?!), line 85: useless use of <=> in void context indicate that a warning was added or deleted. Ultimately, these should go away, but right now it's a low-priority issue. If you ever see ruby stack dump in rubylexervsruby output, that's certainly a test failure. Something that looks like a unidiff chunk body (not header) may indicate an text failure as well. To understand more about how the unidiff output is created, see the section on testing strategy below. locatetest produces lots of (mostly harmless) output. Sorry. htree/template.rb should be ok now. currently, lots of warnings are printed about token offsets being off. (like: "failed to check offset in N cases...") This is a problem, but for now I'm ignoring it. (Most lexer applications don't need token offsets to be correct, and it's only a minority of cases, near here documents, where this problem occurs.) Diff chunks like this indicate a minor problem with the placement of (empty) string fragments. Ignore it for now: @@ -13,2 +13,3 @@ Shifting token tSTRING_BEG () +Shifting token tSTRING_CONTENT () Shifting token tSTRING_DBEG () @@ -19,2 +20,2 @@ Shifting token '\n' () @@ -13,2 +13,3 @@ Shifting token tSTRING_BEG () +Shifting token tSTRING_CONTENT () Shifting token tSTRING_DBEG () Diff chunks like this indicate a minor problem with the placement of newlines. Ignore it for now: @@ -8,3 +8,2 @@ Shifting token tSTRING_END () -Shifting token '\n' () Shifting token "end-of-input" () @@ -8,3 +8,2 @@ Shifting token tSTRING_END () -Shifting token '\n' () Shifting token "end-of-input" () There are a few other problems in the test suite as well. Current test status is less clean than I'd like, tho the conformance level of rubylexer is still very high. if you find any output that doesn't look like one of the above exceptions, (for cases that aren't in the existing snippet set) and the input file was valid ruby, please send it to me so that i can add it to my arsenal of tests. there are a number of 'ruby' files that i know of out there that actually contain syntax errors: rpcd.rb from freeride -- missing an end sample1.rb from 1.6 version of tcltk -- not legal in ruby 1.8 bdb.rb from libdb2, 3, and 4 -- not how you declare [] method in ruby only the 10 first lines of each failing file are printed. the rest, as well as other intermediate files are kept in the testresults directory. the test output files are named *.prs.diff. beware: this directory is never cleaned, and can get quite large. after a large test run, you'll want to empty this directory to recover some disk space. about the directories: tbd about testcode/dumptokens.rb: tbd about testcode/tokentest.rb: a fairly simple-minded test utility; given an input file, it uses RubyLexer to tokenize it, then prints out each token as it is found. certain small changes will be made; numeric constants (including char constants) are converted to decimal and strings are converted to double-quoted form, where possible. optional flags can cause other changes: --maxws inserts whitespace everywhere that it's possible, --implicit inserts parentheses where they were left out at call sites. --implicit-all adds parentheses around the lists following when, for, and rescue keywords. --keepws is the usual mode; otherwise a 'symbolic mode' is used wherein newline is represented by '#;', for instance. note: currently the output will not be valid ruby unless only the --maxws or --keepws is used. in a future release --implicit will also be valid ruby, but currently it also puts '*[' and ']' around assignment right hand sides, which only works most of the time. about testcode/torment: finds ruby files by other heuristics (not using locate) and runs each through rubylexervsruby. this is roughly comparable to locatetest, but more complicated and (probably) less comprehensive. about ./test: this contains a number of ruby files which have failed on my Debian system in the past. as the paths are hard-coded, it's unlikely to be very portable. testing strategy: this command: ruby -w -y < $1 2>&1 | grep ^Shift|cut -d" " -f3 gives a list of the types of token, as known to ruby, in a source file $1. the utility program tokentest.rb runs the lexer against a source file and then simply prints the tokens out again (perhaps with whitespace inserted between tokens). if the list of token types in this derived source file, as determined by the above command, is the same as in the original, we can be pretty confident that ruby and rubylexer are tokenizing in the same way. since whitespaces are optionally inserted between tokens, it is unlikely that rubylexer is ever finding two tokens where ruby thinks there's only one. it is possible, however, that rubylexer is emitting as a single token things that ruby thinks should be 2 tokens. and in fact, this is the case with strings: ruby divides a string into string open, string body, and string close tokens with option interpolations, whereas rubylexer has just a single string token (with subtokens, if interpolations are present.)