# # xampl-pp : XML pull parser # Copyright (C) 2002-2009 Bob Hutchison # # This library is free software; you can redistribute it and/or # modify it under the terms of the GNU Lesser General Public # License as published by the Free Software Foundation; either # version 2.1 of the License, or (at your option) any later version. # # This library is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU # #Lesser General Public License for more details. # # You should have received a copy of the GNU Lesser General Public # License along with this library; if not, write to the Free Software # Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA # require "xampl-pp" ## ## It may seem strange, but it seems that a good way to demonstrate the use ## of the xampl-pp pull parser is to show how to build a SAX-like XML ## parser. Both pull parsers and SAX parsers are stream based -- they parse ## the XML file bit by bit informing its client of interesting events as ## they are encountered. The whole XML document is not required to be in ## memory. The significant difference between pull parsers and SAX parsers ## is in where the 'main loop' is located: in the client for pull parsers, ## in the parser for SAX parsers. Clients call a method of the pull parser ## to get the next event. SAX parsers call methods of the client to notify ## it of events (so these are 'push parsers'). ## ## It turns out to be quite easy to build a SAX-like parser from a pull ## parser. It is quite a lot harder to build a pull parser from a SAX-like ## parser. ## ## This class demonstrates (most) of the xampl-pp interface by implementing a ## SAX-like parser. No attempt has been made to provide all the functionality ## provided by a good Java SAX parser, though the equivalent of a significant, ## and useful, subset is implemented. ## ## The program text is annotated. Note, that the annotations generally ## follow the code being described. ## class SAXish ## ## The Ruby implementation of the xampl-pp parser is called Xampl_PP, and ## SAXish will be the name of our SAX-like parser. ## attr :handler, true ## ## Sax parsers need an event handler. 'handler' is it. Handler is expected to ## implement the methods defined in the module 'saxishHandler'. SaxishHandler ## is intended to be an adapter (so you can include it in any hander you ## write), so only the event-handlers for those events in which you are ## interested in need to be re-defined. SAXdemo is an implementation of ## SaxishHandler that gathers some statistics. ## ## Xampl-pp requires something it calls a resolver. This is a class that ## implements a method called resolve. There are a number of predefined ## entities in xampl-pp: & ' > < and ". It is possible ## to add more entities by adding entries to the entityMap hashtable. If an ## entity is encountered that is not in entityMap then the resolve method on ## the resolver is called. The default resolver returns nil, which causes ## an exception to be thrown. If you specify your own resolver you can do ## anything you like to obtain a value for the entity, or you can return nil ## (and an exception will be thrown). Xampl-pp, by default, is its own ## resolver and simply return nil. ## ## We are going to require that our saxish handler also be the entity ## resolver. This is reflected in the SaxHandler module, which implements ## a resolve method that always returns nil. ## attr :processNamespace, true attr :reportNamespaceAttributes, true ## ## This block of comments can be ignored, certainly for the first reading. ## It talks about some control you have over how the xampl-pp works. The ## default behaviour is the most commonly used. ## ## There are two main controls used here: processNamespace, and ## reportNamespaceAttributes. If processNamespaces is true, then namespaces ## in the XML file being parsed will be processed. Processing means that if ## an element is encountered, then four variables will be ## set up in the parser instance: name is 'name', prefix is 'prefix', ## qname is 'prefix:name', and namespace is defined. If the namespace cannot ## be defined an exception is thrown. In addition the xmlns attributes ## are processed. If processNamespace is false then name and qname ## will both be 'prefix:name', and both prefix and namespace undefined. ## If reportNamespaceAttributes is true then the xmlns attributes will be ## reported along with all the other attributes, if false then they will ## be hidden. The default behaviour is to process namespaces but to not ## report the namespace attributes. ## ## There are two other controls that should be mentioned. They are not ## used here. ## ## Pull parsers are pretty low level tools. They are meant to be fast. While ## may wellformedness constraints are enforced, not all are. If the control ## checkWellFormed is true then additional checks are made. Xampl-pp does ## not guarantee that it will parse only well formed XML documents. It ## will parse some XML files that are not well formed without objecting. In ## future releases, it will be possible to have xampl-pp accept only ## well formed documents. If checkWellFormed is false, then the parser ## doesn't go out of its way to notice ill formed documents. The default ## is true. ## ## The fourth control is 'utf8encode'. If this is true, and it defaults to ## true, then an entity like Ӓ is encountered then it will be encoded ## using utf8 rules. Given the current state of the parser, it would be best ## to leave it set to true. If you want to change this then you must either ## never use &#; encodings with numbers greater than 255 (Ruby will throw an ## exception), or you must redefine xampl-pp's encode method to do the right ## thing. ## def parse(filename) @xpp = Xampl_PP.new @xpp.input = File.new(filename) @xpp.processNamespace = @processNamespace @xpp.reportNamespaceAttributes = @reportNamespaceAttributes @xpp.resolver = @handler work end def parseString(string) @xpp = Xampl_PP.new @xpp.input = string @xpp.processNamespace = @processNamespace @xpp.reportNamespaceAttributes = @reportNamespaceAttributes @xpp.resolver = @handler work end # # Constructing an instance of xampl-pp is pretty straight forward: Xampl_PP.new # # Xampl_PP accepts two kinds of input: IO and String. The same method, # 'input', is used to specify the input. It is possible to set the input # anytime, but if you do, the current input will be closed if it is of # type IO, and the parsing will begin at the current location of the input. # # The methods parse and parseString illustrate. # def work while not @xpp.endDocument? do case @xpp.nextEvent when Xampl_PP::START_DOCUMENT @handler.startDocument when Xampl_PP::END_DOCUMENT @handler.endDocument when Xampl_PP::START_ELEMENT @handler.startElement(@xpp.name, @xpp.namespace, @xpp.qname, @xpp.prefix, attributeCount, @xpp.emptyElement, self) when Xampl_PP::END_ELEMENT @handler.endElement(@xpp.name, @xpp.namespace, @xpp.qname, @xpp.prefix) when Xampl_PP::TEXT @handler.text(@xpp.text, @xpp.whitespace?) when Xampl_PP::CDATA_SECTION @handler.cdataSection(@xpp.text) when Xampl_PP::ENTITY_REF @handler.entityRef(@xpp.name, @xpp.text) when Xampl_PP::IGNORABLE_WHITESPACE @handler.ignoreableWhitespace(@xpp.text) when Xampl_PP::PROCESSING_INSTRUCTION @handler.processingInstruction(@xpp.text) when Xampl_PP::COMMENT @handler.comment(@xpp.text) when Xampl_PP::DOCTYPE @handler.doctype(@xpp.text) end end end def attributeCount return @xpp.attributeName.length end def attributeName(i) return @xpp.attributeName[i] end def attributeNamespace(i) return @xpp.attributeNamespace[i] end def attributeQName(i) return @xpp.attributeQName[i] end def attributePrefix(i) return @xpp.attributePrefix[i] end def attributeValue(i) return @xpp.attributeValue[i] end def depth return @xpp.depth end def line return @xpp.line end def column return @xpp.column end ## ## There is one method used to parse the XML document: nextEvent. It returns ## the type of the event (described below). There are corresponding queries ## defined for each event type. The event is described by variables in the ## xampl-pp instance. ## ## It is possible to obtain the depth in the XML file (i.e. who many elements ## are currently open) using the xampl-pp method 'depth'. This is made ## available to the saxish client using a method on the sishax parser with the ## same name. ## ## The line and column number of the next unparsed character is available ## using the line and column methods. Note that line is always 1 for ## string input. ## ## There is a method, whitespace?, that will tell you if the current text ## value is whitespace. ## ## The event types are: ## ## START_DOCUMENT, END_DOCUMENT -- informational ## ## START_ELEMENT -- on this event several features are defined in the parser ## that are pertinent. name, namespace, qname, prefix describe the element ## tag name. emptyElement is true if the element is of the form , ## false otherwise. And the arrays attributeName, attributeNamespace, ## attributeQName, attributePrefix, and attributeValue contain attribute ## information. The number of attributes is obtained from the length of ## any of these arrays. Attribute information is presented to the sax ## client using six methods: attributeCount, attributeName(i), ## attributeNamespace(i), attributeQName(i), attributePrefix(i), ## attributeValue(i). ## ## END_ELEMENT -- name, namespace, qname, and prefix are defined. NOTE that ## emptyElement will always be false for this event, even though it is called ## for elements of the form . ## ## TEXT -- upon plain text found in an element. Note that it is ## quite possible that several text events in succession may be made for a ## single run of text in the XML file ## ## CDATA_SECTION -- upon a CDATA section. Note that it is quite possible ## that several CDATA events in succession may be made for a single CDATA ## section. ## ## ENTITY_REF -- for each entity encountered. It will have the ## value in the text field, and the name in the name field. ## ## IGNORABLE_WHITESPACE -- for whitespace that occurs at the document ## level of the XML file (i.e. outside the root element). This whitespace is ## meaningless in XML and so can be ignored (and so the name). If you are ## interested in it, the whitespace is in the text field. ## ## PROCESSING_INSTRUCTION -- upon a processing instruction. The content of ## the processing instruction (with the removed) is provied in ## the text field. ## ## COMMENT -- upon a comment. The content of the comment (with the removed) is provied in the text field. ## ## DOCTYPE -- upon encountering a doctype. The content of the doctype ## (with the removed) is provided in the text field. ## ## The event query methods are: cdata?, comment?, doctype?, endDocument?, ## endElement?, entityRef?, ignorableWhitespace?, processingInstruction?, ## startDocument?, startElement?, and text? ## end