String

This is an extension and modification of the standard String class. We do a lot of UTF-8 character processing in the parser. Ruby 1.8 does not have good enough UTF-8 support and Ruby 1.9 only handles UTF-8 characters as Strings. This is very inefficient compared to representing them as Fixnum objects. Some of these hacks can be removed once we have switched to 1.9 support only.

Public Instance Methods

<<(obj) click to toggle source

Replacement for the existing << operator that also works for characters above Fixnum 255 (UTF-8 characters).

    # File lib/UTF8String.rb, line 59
59:     def << (obj)
60:       if obj.is_a?(String) || (obj < 256)
61:         # In this case we can use the built-in concat.
62:         concat(obj)
63:       else
64:         # UTF-8 characters have a maximum length of 4 byte and no byte is 0.
65:         mask = 0xFF000000
66:         pos = 3
67:         while pos >= 0
68:           # Use the built-in concat operator for each byte.
69:           concat((obj & mask) >> (8 * pos)) if (obj & mask) != 0
70:           # Move mask and position to the next byte.
71:           mask = mask >> 8
72:           pos -= 1
73:         end
74:       end
75:     end

Also aliased as: old_double_left_angle

each_utf8_char() click to toggle source

Iterate over the String calling the block for each UTF-8 character in the String. This implementation looks more awkward but is noticeably faster than the often propagated regexp based implementations.

    # File lib/UTF8String.rb, line 28
28:     def each_utf8_char
29:       c = ''
30:       length = 0
31:       each_byte do |b|
32:         c << b
33:         if length > 0
34:           # subsequent unicode byte
35:           if (length -= 1) == 0
36:             # end of unicode character reached
37:             yield c
38:             c = ''
39:           end
40:         elsif (b & 0xC0) == 0xC0
41:           # first unicode byte
42:           length = 1
43:           while (b & 0x80) != 0
44:             length += 1
45:             b = b << 1
46:           end
47:         else
48:           # ASCII character
49:           yield c
50:           c = ''
51:         end
52:       end
53:     end

length_utf8() click to toggle source

Return the number of UTF8 characters in the String. We don’t override the built-in length() function here as we don’t know who else uses it for what purpose.

    # File lib/UTF8String.rb, line 80
80:     def length_utf8
81:       len = 0
82:       each_utf8_char { |c| len += 1 }
83:       len
84:     end

old_double_left_angle(obj) click to toggle source

Alias for: <<

old_reverse() click to toggle source

Alias for: reverse

reverse() click to toggle source

UTF-8 aware version of reverse that replaces the built-in one.

    # File lib/UTF8String.rb, line 89
89:     def reverse
90:       a = []
91:       each_utf8_char { |c| a << c }
92:       a.reverse.join
93:     end

Also aliased as: old_reverse

to_quoted_printable() click to toggle source

     # File lib/UTF8String.rb, line 100
100:   def to_quoted_printable
101:     [self].pack('M').gsub(/\n/, "\r\n")
102:   end

Home Classes Methods

In Files

Parent

Methods

Files

Class Index

String

Public Instance Methods