## Features - `ZScan#scan`/`ZScan#skip`/`ZScan#match_bytesize` accept either string or regexp as param. - `ZScan#pos` is the codepoint position, and `ZScan#bytepos` is byte position. - Correctly scans anchors and look behind predicates. - Pos stack manipulation. - Typed scanning methods: `#scan_float`, `#scan_int radix=nil`, `#scan_date format`, `#scan_binary format`. ## Install ```bash gem ins zscan ``` ## Typical use ``` ruby require 'zscan' z = ZScan.new 'hello world' z.scan 'hello' #=> 'hello' z.skip ' ' z.scan /\w+/ #=> 'world' z.eos? #=> true ``` ## Motivation: string scanner Ruby's stdlib `StringScanner` treats the scanning position as beginning of string: ```ruby require 'strscan' s = StringScanner.new 'ab' s.pos = 1 s.scan /(? nil s.scan /^/ #=> '' ``` But for building parser generators, I need the scanner check the whole string for anchors and lookbehinds: ```ruby require 'zscan' z = ZScan.new 'ab' z.pos = 1 z.scan /(? '' z.scan /^/ #=> nil ``` See also https://bugs.ruby-lang.org/issues/7092 ## Other motivations - For scan and convert, ruby's stdlib `Scanf` is slow (creates regexp array everytime called) and not possible to corporate with scanner. - For date parsing, `strptime` doesn't tell the parsed length. - For binary parsing, `unpack` is an slow interpreter, and the instructions are quite irregular. ## Essential methods - `ZScan.new string, dup=false` - `#scan regexp_or_string` - `#skip regexp_or_string` - `#match_bytesize regexp_or_string` return length of matched bytes or `nil`. - `#scan_float` scan a float number which is not starting with space. It deals with multibyte encodings for you. - `#scan_int radix=nil` if radix is nil, decide base by prefix: `0x` is 16, `0` is 8, `0b` is 2, otherwise 10. `radix` should be in range `2..36`. - `#scan_date format_string, start=Date::ITALY` scan a `DateTime` object, see also [strptime](http://rubydoc.info/stdlib/date/DateTime.strptime). - `#scan_binary binary_spec` optimized and readable binary scan, see below for how to create a `ZScan::BinarySpec`. - `#unpack format_string` - `#eos?` - `#string` note: return a dup. Don't worry the performance because it is a copy-on-write string. - `#rest` ## String delegates For convienience - `#<< append_string` - `#[]= range, replace_string` note: if `range` starts before pos, moves pos left, also clears the stack. - `#size` - `#bytesize` ## Pos management - `#pos` - `#pos= new_pos` note: complexity ~ `new_pos > pos ? new_pos - pos : new_pos`. - `#bytepos` - `#bytepos= new_bytepos` note: complexity ~ `abs(new_bytepos - bytepos)`. - `#line_index` line index of current position, start from `0`. - `#advance n` move forward `n` codepoints, if `n < 0`, move backward. Stops at beginning or end. - `#reset` go to beginning. - `#terminate` go to end of string. ## Binary parsing Specify a sequence of binary data. Designed for binary protocol parsing. Example: ```ruby # create a ZScan::BinarySpec s = ZScan.binary_spec do int8 # once uint32_le 2 # little endian, twice double_be 1 # big endian, once end z = ZScan.new [-1, 2, 3, 4.0].pack('cI<2G') + "rest" z.scan_binary s #=> [-1, 2, 3, 4.0] z.rest #=> 'rest ``` Integer instructions: ```ruby int8 uint8 int16 uint16 int16_le uint16_le int16_be uint16_be int32 uint32 int32_le uint32_le int32_be uint32_be int64 uint64 int64_le uint64_le int64_be uint64_be ``` Single precision float instructions: ```ruby single single_le single_be ``` Double precision float instructions: ```ruby double double_le double_be ``` Endians: - (without endian suffix) native endian - `*_le` little endian (VAX, x86, Windows string code unit) - `*_be` big endian, network endian (SPARC, Java string code unit) Repeat count must be integer `>= 1`, default is `1`. It is implemented as a direct-threaded bytecode interpreter. A bit faster than `String#unpack`. ## Parsing combinators Combinators that manage scanner pos and stack state for you. In the combinators, if the returned value of the given block is `nil` or `false`, stops iteration and restores scanner location. Can be nested, useful for building parsers. - `#try &block` returns `block`'s return. - `#zero_or_one acc=[], &block` try to execute 0 or 1 time, returns `acc`. - `#zero_or_more acc=[], &block` try to execute 0 or more times, also stops iteration if scanner no advance, returns `acc`. - `#one_or_more acc=[], &block` try to execute 1 or more times, also stops iteration if scanner no advance, returns `nil` or `acc`. ## (Low level) Efficient pos stack manipulation - `#push` push current pos into the stack. - `#pop` set current pos to top of the stack, and pop it. - `#drop` drop top of pos stack without changing current pos. - `#restore` set current pos to top of the stack. - `#clear_pos_stack` clear pos stack. - `z.push._try expr` equivalent to `z.try{ expr }`, but faster because no block is required ## License ``` Copyright (C) 2013 by Zete Lui (BSD) Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN ```