This document describes the conformance of each parser included in xmlscan for XML related specifications.
XMLscan is one of "non-validating XML processor" according to XML 1.0 Specification [XML]. XMLscan is satisfied with almost conditions required for a non-validation XML processor, though, for the limitations of implementations, there are mainly the following restrictions. For detail, See the below descriptions for each class.
XMLScan::XMLScanner tokenize an XML document and only recognize each XML declaration, document type declaration, processing instruction, comment, start tag, end tag, empty element tag, CDATA section, general entity reference, and character reference. It is NOT an error even that one of these parts appears in the context which prohibits existence of it, except in the case described below.
It is reported as an parse error that an XML declaration, document type definition (except internal DTD subset), processing instruction, comment, start tag, end tag, empty element tag, CDATA section, general entity reference, or a character reference is not matched with its production defined in XML 1.0 Specification [XML].
For reasonably speed, if `strict_char' option is not specified, XMLScan::XMLScanner doesn't check whether a name or character data includes an illegal characters for it. All characters except ones recognized as one of delimiters in that context are allowed. To be more precise, without `strict_char' option, the production Char[2], Name[5], Nmtoken[7], EntityValue[9], AttValue[10], SystemLiteral[11], PubidChar[13], CharData[14], VersionNum[26], and EncName[81] are not checked strictly.
XMLScan::XMLScanner doesn't normalize linebreaks.
Since Ruby is not supported UTF-16, it is impossible to parse an XML document encoded in UTF-16 as it is. You need to convert it to UTF-8 before parsing.
`<?xml' in a place except the beginning of an XML document is regarded as a processing instruction.
It is not checked whether the value of a standalone document documentation is either "yes" or "no".
It is not checked whether a target in a processing instruction is not "xml" or like, which is a reserved target.
It is reported as a parse error in the case that a document type declaration appears in a place except prolog, or two or more document type declarations are found in one document.
It is reported as a well-formedness constraint violation that `<' appears directly in a attribute value. If strict_char option is specified, XMLScan::XMLScanner checks well-formedness constraint: Legal Character. Any other well-formedness constraints are not checked.
XMLScan::XMLScanner skips an internal DTD subset.
The goal of XMLScan::XMLParser is to satisfy almost all conditions required to a non-validating XML parser.
The description for XMLScan::XMLScanner about `strict_char' option and the description for UTF-16 are applicable to XMLScan::XMLParser. The following well-formedness constraints about a character reference are checked only if `strict_char' option is specified;
XMLScan::XMLScanner doesn't normalize linebreaks.
XMLScan::XMLParser skips an internal DTD subset. The following well-formedness constraints about an internal DTD subset are not checked;
All general entity references except ones to predefined entities (lt,gt,amp,quot,apos) are reported as ones to undeclared entities.
External DTD subsets are not read. The following well-formedness constraints about an external DTD subset are not checked;
Since XMLScan::XMLParser cannot check whether a replacement text of an undeclared entity includes `<', the following well-formedness constraints are not checked completely;
XMLScan::XMLNamespace checks for all constraints specified in ``Namespaces in XML'' and its errata [Namespaces], and ensure that an XML document is namespace-well-formed.
All limitations for XMLScan::XMLParser are inherited to XMLScan::XMLNamespace.