The TextParser implements a regular LALR parser. But it uses a recursive rule traversor instead of the more commonly found state machine generated by yacc-like tools. Since stack depths is not really an issue with a Ruby implementation this approach has one big advantage. The syntax of the parser can be modified during parsing. This allows support for languages that can extend themself. The TaskJuggler syntax is such an beast. Traditional yacc generated parsers would fail with such a syntax.
This class is just a base class. A complete parser would derive from this class and implement the rule set and the functions _nextToken()_ and _returnToken()_. It also needs to set the array variables to declare all variables ($SOMENAME) that the scanner may deliver.
To describe the syntax the functions TextParser#pattern, TextParser#optional and TextParser#repeatable can be used. When the rule set is changed during parsing, TextParser#updateParserTables must be called to make the changes effective. The parser can also document the syntax automatically. To document a pattern, the functions TextParser#doc, TextParser#descr, TextParser#also and TextParser#arg can be used.
To start parsing the input the function TextParser#parse needs to be called with the name of the start rule.
encoding: UTF-8
Copyright © 2006, 2007, 2008, 2009, 2010 by Chris Schlaeger
This program is free software; you can redistribute it and/or modify it under the terms of version 2 of the GNU General Public License as published by the Free Software Foundation.
encoding: UTF-8
Copyright © 2006, 2007, 2008, 2009, 2010 by Chris Schlaeger
This program is free software; you can redistribute it and/or modify it under the terms of version 2 of the GNU General Public License as published by the Free Software Foundation.
encoding: UTF-8
Copyright © 2006, 2007, 2008, 2009, 2010 by Chris Schlaeger
This program is free software; you can redistribute it and/or modify it under the terms of version 2 of the GNU General Public License as published by the Free Software Foundation.
Create a new TextParser object.
# File lib/TextParser.rb, line 71 71: def initialize(messageHandler) 72: # The message handler will collect all error messages. 73: @messageHandler = messageHandler 74: # This Hash will store the ruleset that the parser is operating on. 75: @rules = { } 76: # Array to hold the token types that the scanner can return. 77: @variables = [] 78: # An list of token types that are not allowed in the current context. 79: @badVariables = [] 80: # The currently processed rule. 81: @cr = nil 82: end
# File lib/TextParser.rb, line 197 197: def error(id, text, sfi = nil, data = nil) 198: sfi ||= sourceFileInfo 199: if @scanner 200: # The scanner has some more context information, so we pass the error 201: # on to the TextScanner. 202: @scanner.error(id, text, sfi, data) 203: else 204: @messageHandler.error(id, text, sfi, data) 205: end 206: end
Call all methods that start with ‘rule_’ to initialize the rules.
# File lib/TextParser.rb, line 94 94: def initRules 95: methods.each do |m| 96: if m[0, 5] == 'rule_' 97: # Create a new rule with the suffix of the function name as name. 98: newRule(m[5..1]) 99: # Call the function. 100: send(m) 101: end 102: end 103: end
Limit the allowed tokens of the scanner to the subset passed by the tokenSet Array.
# File lib/TextParser.rb, line 86 86: def limitTokenSet(tokenSet) 87: return unless tokenSet 88: 89: @badVariables = @variables.dup 90: @badVariables.delete_if { |v| tokenSet.include?(v) } 91: end
# File lib/TextParser.rb, line 188 188: def matchingRules(keyword) 189: matches = [] 190: @rules.each do |name, rule| 191: patIdx = rule.matchingPatternIndex('_' + keyword) 192: matches << [ rule, patIdx ] if patIdx 193: end 194: matches 195: end
Add a new rule to the rule set. name must be a unique identifier. The function also sets the class variable @cr to the new rule. Subsequent calls to TextParser#pattern, TextParser#optional or TextParser#repeatable will then implicitely operate on the most recently added rule.
# File lib/TextParser.rb, line 110 110: def newRule(name) 111: raise "Fatal Error: Rule #{name} already exists" if @rules.has_key?(name) 112: 113: if block_given? 114: saveCr = @cr 115: @rules[name] = @cr = TextParser::Rule.new(name) 116: yield 117: @cr = saveCr 118: else 119: @rules[name] = @cr = TextParser::Rule.new(name) 120: end 121: end
Identify the patterns of the most recently added rule as optional syntax elements.
# File lib/TextParser.rb, line 140 140: def optional 141: @cr.setOptional 142: end
To parse the input this function needs to be called with the name of the rule to start with. It returns the result of the processing function of the top-level parser rule that was specified by ruleName. In case of an error, the result is false.
# File lib/TextParser.rb, line 164 164: def parse(ruleName) 165: @stack = [] 166: @@expectedTokens = [] 167: updateParserTables 168: begin 169: result = parseRuleR(@rules[ruleName]) 170: rescue TjException => msg 171: if msg.message && !msg.message.empty? 172: @messageHandler.critical('parse', msg.message) 173: end 174: return false 175: end 176: 177: result 178: end
Add a new pattern to the most recently added rule. tokens is an array of strings that specify the syntax elements of the pattern. Each token must start with an character that identifies the type of the token. The following types are supported.
! a reference to another rule
$ a variable token as delivered by the scanner
_ a literal token.
func is a Proc object that is called whenever the parser has completed the processing of this rule.
# File lib/TextParser.rb, line 134 134: def pattern(tokens, func = nil) 135: @cr.addPattern(TextParser::Pattern.new(tokens, func)) 136: end
Identify the patterns of the most recently added rule as repeatable syntax elements.
# File lib/TextParser.rb, line 146 146: def repeatable 147: @cr.setRepeatable 148: end
Return the SourceFileInfo of the TextScanner at the beginning of the currently processed TextParser::Rule. Or return nil if we don’t have a current position.
# File lib/TextParser.rb, line 183 183: def sourceFileInfo 184: return nil if @stack.nil? || @stack.empty? 185: @stack.last.sourceFileInfo[0] 186: end
This function needs to be called whenever new rules or patterns have been added and before the next call to TextParser#parse.
# File lib/TextParser.rb, line 152 152: def updateParserTables 153: @rules.each_value { |rule| rule.transitions = {} } 154: @rules.each_value do |rule| 155: getTransitions(rule) 156: checkRule(rule) 157: end 158: end
# File lib/TextParser.rb, line 208 208: def warning(id, text, sfi = nil, data = nil) 209: sfi ||= sourceFileInfo 210: if @scanner 211: # The scanner has some more context information, so we pass the 212: # warning on to the TextScanner. 213: @scanner.warning(id, text, sfi, data) 214: else 215: @messageHandler.warning(id, text, sfi, data) 216: end 217: end
# File lib/TextParser.rb, line 277 277: def checkRule(rule) 278: if rule.patterns.empty? 279: raise "Rule #{rule.name} must have at least one pattern" 280: end 281: 282: rule.patterns.each do |pat| 283: pat.each do |tok| 284: type = tok[0] 285: token = tok[1..1] 286: if type == $$ 287: if @variables.index(token).nil? 288: error('unsupported_token', 289: "The token #{token} is not supported here.", token[2]) 290: end 291: elsif type == !! 292: if @rules[token].nil? 293: raise "Fatal Error: Reference to unknown rule #{token} in " + 294: "pattern '#{pat}' of rule #{rule.name}" 295: end 296: end 297: end 298: end 299: end
# File lib/TextParser.rb, line 537 537: def findPattern(rule, token, repeatMode) 538: # The scanner cannot differentiate between keywords and identifiers. So 539: # whenever an identifier is returned we have to see if we have a 540: # matching keyword first. If none is found, then look for normal 541: # identifiers. 542: if token[0] == 'ID' 543: if (patIdx = rule.matchingPatternIndex('_' + token[1])).nil? 544: patIdx = rule.matchingPatternIndex("$ID") 545: end 546: elsif token[0] == 'LITERAL' 547: patIdx = rule.matchingPatternIndex('_' + token[1]) 548: elsif token[0] == false 549: patIdx = rule.matchingPatternIndex('.') 550: else 551: patIdx = rule.matchingPatternIndex('$' + token[0]) 552: end 553: 554: # If no matching pattern is found for the token we have to check if the 555: # rule is optional or we are in repeat mode. If this is the case, return 556: # the token back to the scanner. Otherwise, we have found a token we 557: # cannot handle at this point. 558: if patIdx.nil? 559: # Append the list of expected tokens to the @@expectedToken array. 560: # This may be used in a later rule to provide more details when an 561: # error occured. 562: rule.transitions.each do |transition| 563: keys = transition.keys 564: keys.collect! { |key| key[1..1] } 565: @@expectedTokens += keys 566: @@expectedTokens.sort! 567: end 568: 569: unless rule.optional?(@rules) || repeatMode 570: error('unexpctd_token', 571: (token[0] != false ? 572: "Unexpected token '#{token[1]}' of type " + 573: "'#{token[0]}'. " : 574: "Unexpected end of file in #{@scanner.fileName}. ") + 575: (@@expectedTokens.length > 1 ? 576: "Expecting one of #{@@expectedTokens.join(', ')}" : 577: "Expecting #{@@expectedTokens[0]}"), token[2]) 578: end 579: returnToken(token) 580: return nil 581: end 582: 583: rule.pattern(patIdx) 584: end
# File lib/TextParser.rb, line 526 526: def getNextToken 527: token = nextToken 528: #Log << "Token: [#{token[0]}][#{token[1]}]" 529: if @badVariables.include?(token[0]) 530: error('unsupported_token', 531: "The token #{token[1]} is not supported in this context.", 532: token[2]) 533: end 534: token 535: end
getTransitions recursively determines all possible target tokens that the rule matches. A target token can either be a fixed token (prefixed with _), a variable token (prefixed with $) or an end token (just a .). The list of found target tokens is stored in the transitions list of the rule. For each rule pattern we store the transitions for this pattern in a token -> rule hash.
# File lib/TextParser.rb, line 227 227: def getTransitions(rule) 228: # If we have processed this rule before we can just return a copy 229: # of the transitions of this rule. This avoids endless recursions. 230: return rule.transitions.dup unless rule.transitions.empty? 231: 232: rule.transitions = [] 233: rule.patterns.each do |pat| 234: allTokensOptional = true 235: transitions = { } 236: pat.each do |token| 237: tokenId = token[1..1] 238: if token[0] == !! 239: unless @rules.has_key?(tokenId) 240: raise "Fatal Error: Unknown reference to #{tokenId} in pattern " + 241: "#{pat} + of rule #{rule.name}" 242: end 243: refRule = @rules[tokenId] 244: # If the referenced rule describes optional content, we need to look 245: # at the next token as well. 246: res = getTransitions(@rules[tokenId]) 247: allTokensOptional = false unless refRule.optional?(@rules) 248: # Combine the hashes for each pattern into a single hash 249: res.each do |pat_i| 250: pat_i.each { |tok, r| transitions[tok] = r } 251: end 252: elsif '_$.'.include?(token[0]) 253: transitions[token] = rule 254: allTokensOptional = false 255: else 256: raise 'Fatal Error: Illegal token type specifier used for token' + 257: ": #{tokenId}" 258: end 259: break unless allTokensOptional 260: end 261: # Make sure that we only have one possible transition for each 262: # target. 263: transitions.each do |key, value| 264: rule.transitions.each do |trans| 265: if trans.has_key?(key) 266: rule.dump 267: raise "Fatal Error: Rule #{rule.name} has ambiguous " + 268: "transitions for target #{key}" 269: end 270: end 271: end 272: rule.transitions << transitions 273: end 274: rule.transitions.dup 275: end
This function processes the input starting with the syntax description of rule. It’s implemented as an unrolled recursion. It recursively iterates over the rule tree as controlled by the input file. This version is not limited by the size of the system stack. So far, I’m not aware of any project that is too large for the system stack. Since the recursive version parseRuleR is about 8% faster and has cleaner code, we use that by default.
# File lib/TextParser.rb, line 387 387: def parseRuleNR(rule) 388: elementIdx = 0 389: recursionResult = nil 390: # These flags are used to managed the control flow to and from the 391: # recursion point. 392: recur = resume = false 393: # The stack that holds the context for the recursion levels. It's either 394: # just a rule to start a new recursion or an Array of state variables. 395: recursionStack = [ rule ] 396: begin 397: # Pop the top entry from the recursion stack. 398: se = recursionStack.pop 399: if se.is_a?(Array) 400: # We have essentially finished a recursion level and need to get 401: # back to the place where we started the recursion. First, we need 402: # to restore the state again. 403: rule, pattern, elementIdx, result, repeatMode, sfi = se 404: #Log << "Recursion loop started in resume mode for rule #{rule.name}" 405: # Now jump to the recursion point without doing anything else. 406: resume = true 407: else 408: # Start a new recursion level. The rule tells us how to interpret 409: # the input text. 410: rule = se 411: #Log.enter('parseRuleNR', "Parsing with rule #{rule.name}") 412: resume = false 413: end 414: 415: unless resume 416: result = rule.repeatable ? TextParserResultArray.new : nil 417: # Rules can be marked 'repeatable'. This flag will be set to true 418: # after the first iteration has been completed. 419: repeatMode = false 420: end 421: 422: loop do 423: unless resume 424: # At the beginning of a rule we need a token from the input to 425: # determine which pattern of the rule needs to be processed. 426: token = getNextToken 427: 428: break unless (pattern = findPattern(rule, token, repeatMode)) 429: # The @stack will store the resulting value of each element in the 430: # pattern. 431: @stack << TextParser::StackElement.new(pattern.function) 432: 433: # Once we've found the right pattern, we need to process each 434: # element. 435: elementIdx = 0 436: end 437: 438: elementCount = pattern.length 439: while elementIdx < elementCount 440: element = pattern[elementIdx] 441: # Separate the type and token text for pattern element. 442: elType = element[0] 443: elToken = element[1..1] 444: if elType == !! 445: unless resume 446: # The element is a reference to another rule. Return the token 447: # if we still have one and continue with the referenced rule. 448: if token 449: sfi = token[2] 450: returnToken(token) 451: token = nil 452: else 453: sfi = nil 454: end 455: # This is where the recursion would happen. Instead, we push 456: # the state variables and then the next rule onto the 457: # recursion stack. 458: recursionStack.push([ rule, pattern, elementIdx, result, 459: repeatMode, sfi ]) 460: recursionStack.push(@rules[elToken]) 461: # Now terminate all but the outer loops without doing anything 462: # else. 463: recur = true 464: break 465: else 466: # We're back right after where the recursion started. Store 467: # the result and turn resume mode off again. 468: @stack.last.store(recursionResult, sfi) 469: resume = false 470: end 471: else 472: # In case the element is a keyword or variable we have to get a 473: # new token if we don't have one anymore. 474: token = getNextToken unless token 475: 476: processNormalElements(elType, elToken, token) 477: 478: # The token has been consumed. Reset the variable. 479: token = nil 480: @@expectedTokens = [] 481: end 482: elementIdx += 1 483: end # of pattern while loop 484: 485: # Skip the rest of the loop in recur mode. 486: break if recur 487: 488: elementIdx = 0 489: 490: # Once the complete pattern has been processed we call the 491: # processing function for this pattern to operate on the value 492: # array. Then pop the entry for this rule from the stack. The 493: # called function will use @val and @sourceFileInfo to retrieve 494: # data from the parser. 495: @val = @stack.last.val 496: @sourceFileInfo = @stack.last.sourceFileInfo 497: res = @stack.last.function ? @stack.last.function.call : nil 498: @stack.pop 499: 500: # If the rule is not repeatable we can store the result and break 501: # the outer loop to exit the function. 502: unless rule.repeatable 503: result = res 504: break 505: end 506: 507: # Otherwise we append the result to the result array and turn repeat 508: # mode on. 509: result << res 510: # We have completed the first iteration. Set the repeat mode flag to 511: # indicate that further iterations are already re-runs. 512: repeatMode = true 513: end # of rule processing loop 514: 515: if recur 516: recur = false 517: else 518: #Log.exit('parseRuleNR', "Finished rule #{rule.name}") 519: recursionResult = result 520: end 521: end while !recursionStack.empty? 522: 523: return result 524: end
This function processes the input starting with the syntax description of rule. It recursively calls this function whenever the syntax description contains the reference to another rule. This recursive version has cleaner code and is about 8% faster than parseRuleNR.
# File lib/TextParser.rb, line 306 306: def parseRuleR(rule) 307: #Log.enter('parseRuleR', "Parsing with rule #{rule.name}") 308: #puts "Parsing with rule #{rule.name}" 309: result = rule.repeatable ? TextParserResultArray.new : nil 310: # Rules can be marked 'repeatable'. This flag will be set to true after 311: # the first iteration has been completed. 312: repeatMode = false 313: loop do 314: # At the beginning of a rule we need a token from the input to determine 315: # which pattern of the rule needs to be processed. 316: token = getNextToken 317: 318: return result unless (pattern = findPattern(rule, token, repeatMode)) 319: # The @stack will store the resulting value of each element in the 320: # pattern. 321: @stack << TextParser::StackElement.new(pattern.function) 322: 323: pattern.each do |element| 324: # Separate the type and token text for pattern element. 325: elType = element[0] 326: elToken = element[1..1] 327: if elType == !! 328: # The element is a reference to another rule. Return the token if 329: # we still have one and continue with the referenced rule. 330: unless token.nil? 331: sfi = token[2] 332: returnToken(token) 333: token = nil 334: else 335: sfi = nil 336: end 337: @stack.last.store(parseRuleR(@rules[elToken]), sfi) 338: #Log << "Resuming rule #{rule.name}" 339: #puts "Resuming rule #{rule.name}" 340: else 341: # In case the element is a keyword or variable we have to get a new 342: # token if we don't have one anymore. 343: token = getNextToken unless token 344: 345: processNormalElements(elType, elToken, token) 346: 347: # The token has been consumed. Reset the variable. 348: token = nil 349: @@expectedTokens = [] 350: end 351: end 352: 353: # Once the complete pattern has been processed we call the processing 354: # function for this pattern to operate on the value array. Then pop the 355: # entry for this rule from the stack. 356: @val = @stack.last.val 357: @sourceFileInfo = @stack.last.sourceFileInfo 358: res = nil 359: res = @stack.last.function.call unless @stack.last.function.nil? 360: @stack.pop 361: 362: # If the rule is not repeatable we can store the result and break the 363: # outer loop to exit the function. 364: unless rule.repeatable 365: result = res 366: break 367: end 368: 369: # Otherwise we append the result to the result array and turn repeat 370: # mode on. 371: result << res 372: repeatMode = true 373: end 374: 375: #Log.exit('parseRuleR', "Finished rule #{rule.name}") 376: #puts "Finished rule #{rule.name}" 377: return result 378: end
Handle the elements that don’t trigger a recursion.
# File lib/TextParser.rb, line 587 587: def processNormalElements(elType, elToken, token) 588: if elType == __ 589: # If the element requires a keyword the token must match this 590: # keyword. 591: if elToken != token[1] 592: text = "'#{elToken}' expected but found " + 593: "'#{token[1]}' (#{token[0]})." 594: unless @@expectedTokens.empty? 595: text = "#{@@expectedTokens.join(', ')} or " + text 596: end 597: error('spec_keywork_expctd', text, token[2]) 598: end 599: @stack.last.store(elToken, token[2]) 600: elsif elType == .. 601: if token[0..1] != [ '.', '<END>' ] 602: error('end_expected', 603: "Found garbage at expected end of text: #{token[1]}\n" + 604: "If you see this in the middle of your text, you probably " + 605: "have closed your context too early.", token[2]) 606: end 607: else 608: # The token must match the expected variable type. 609: if token[0] != elToken 610: text = "'#{elToken}' expected but found " + 611: "'#{token[1]}' (#{token[0]})." 612: unless @@expectedTokens.empty? 613: text = "#{@@expectedTokens.join(', ')} or " + text 614: end 615: error('spec_token_expctd', text, token[2]) 616: end 617: # If the element is a variable store the value of the token. 618: @stack.last.store(token[1], token[2]) 619: end 620: end
Disabled; run with --debug to generate this.
Generated with the Darkfish Rdoc Generator 1.1.6.